2012-10-25 02:37:44

by Dave Jones

[permalink] [raw]
Subject: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

Machine under significant load (4gb memory used, swap usage fluctuating)
triggered this...

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
Call Trace:
[<ffffffff8107100f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8107106a>] warn_slowpath_null+0x1a/0x20
[<ffffffff811903fc>] shmem_getpage_gfp+0xa5c/0xa70
[<ffffffff8118fc3e>] ? shmem_getpage_gfp+0x29e/0xa70
[<ffffffff81190e4f>] shmem_fault+0x4f/0xa0
[<ffffffff8119f391>] __do_fault+0x71/0x5c0
[<ffffffff810e1ac6>] ? __lock_acquire+0x306/0x1ba0
[<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
[<ffffffff811a2767>] handle_pte_fault+0x97/0xae0
[<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
[<ffffffff8136d68e>] ? delay_tsc+0xae/0x120
[<ffffffff8136d578>] ? __const_udelay+0x28/0x30
[<ffffffff811a4a39>] handle_mm_fault+0x289/0x350
[<ffffffff816d091e>] __do_page_fault+0x18e/0x530
[<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
[<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
[<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
[<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
[<ffffffff8112d389>] ? rcu_user_exit+0xc9/0xf0
[<ffffffff816d0ceb>] do_page_fault+0x2b/0x50
[<ffffffff816cd3b8>] page_fault+0x28/0x30
[<ffffffff8136d259>] ? copy_user_enhanced_fast_string+0x9/0x20
[<ffffffff8121c181>] ? sys_futimesat+0x41/0xe0
[<ffffffff8102bf35>] ? syscall_trace_enter+0x25/0x2c0
[<ffffffff816d5625>] ? tracesys+0x7e/0xe6
[<ffffffff816d5688>] tracesys+0xe1/0xe6



1148 error = shmem_add_to_page_cache(page, mapping, index,
1149 gfp, swp_to_radix_entry(swap));
1150 /* We already confirmed swap, and make no allocation */
1151 VM_BUG_ON(error);
1152 }


total used free shared buffers cached
Mem: 3885528 2854064 1031464 0 9624 19208
-/+ buffers/cache: 2825232 1060296
Swap: 6029308 30656 5998652


2012-10-25 04:36:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Wed, 24 Oct 2012, Dave Jones wrote:

> Machine under significant load (4gb memory used, swap usage fluctuating)
> triggered this...
>
> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
> Call Trace:
> [<ffffffff8107100f>] warn_slowpath_common+0x7f/0xc0
> [<ffffffff8107106a>] warn_slowpath_null+0x1a/0x20
> [<ffffffff811903fc>] shmem_getpage_gfp+0xa5c/0xa70
> [<ffffffff8118fc3e>] ? shmem_getpage_gfp+0x29e/0xa70
> [<ffffffff81190e4f>] shmem_fault+0x4f/0xa0
> [<ffffffff8119f391>] __do_fault+0x71/0x5c0
> [<ffffffff810e1ac6>] ? __lock_acquire+0x306/0x1ba0
> [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
> [<ffffffff811a2767>] handle_pte_fault+0x97/0xae0
> [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
> [<ffffffff8136d68e>] ? delay_tsc+0xae/0x120
> [<ffffffff8136d578>] ? __const_udelay+0x28/0x30
> [<ffffffff811a4a39>] handle_mm_fault+0x289/0x350
> [<ffffffff816d091e>] __do_page_fault+0x18e/0x530
> [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
> [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
> [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
> [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
> [<ffffffff8112d389>] ? rcu_user_exit+0xc9/0xf0
> [<ffffffff816d0ceb>] do_page_fault+0x2b/0x50
> [<ffffffff816cd3b8>] page_fault+0x28/0x30
> [<ffffffff8136d259>] ? copy_user_enhanced_fast_string+0x9/0x20
> [<ffffffff8121c181>] ? sys_futimesat+0x41/0xe0
> [<ffffffff8102bf35>] ? syscall_trace_enter+0x25/0x2c0
> [<ffffffff816d5625>] ? tracesys+0x7e/0xe6
> [<ffffffff816d5688>] tracesys+0xe1/0xe6
>
>
>
> 1148 error = shmem_add_to_page_cache(page, mapping, index,
> 1149 gfp, swp_to_radix_entry(swap));
> 1150 /* We already confirmed swap, and make no allocation */
> 1151 VM_BUG_ON(error);
> 1152 }

That's very surprising. Easy enough to handle an error there, but
of course I made it a VM_BUG_ON because it violates my assumptions:
I rather need to understand how this can be, and I've no idea.

Clutching at straws, I expect this is entirely irrelevant, but:
there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
in current linux.git; rather, there's a VM_BUG_ON on line 1149.

So you've inserted a couple of lines for some reason (more useful
trinity behaviour, perhaps)? And have some config option I'm
unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?

Hugh

>
>
> total used free shared buffers cached
> Mem: 3885528 2854064 1031464 0 9624 19208
> -/+ buffers/cache: 2825232 1060296
> Swap: 6029308 30656 5998652

2012-10-25 04:50:59

by Ni zhan Chen

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On 10/25/2012 12:36 PM, Hugh Dickins wrote:
> On Wed, 24 Oct 2012, Dave Jones wrote:
>
>> Machine under significant load (4gb memory used, swap usage fluctuating)
>> triggered this...
>>
>> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
>> Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
>> Call Trace:
>> [<ffffffff8107100f>] warn_slowpath_common+0x7f/0xc0
>> [<ffffffff8107106a>] warn_slowpath_null+0x1a/0x20
>> [<ffffffff811903fc>] shmem_getpage_gfp+0xa5c/0xa70
>> [<ffffffff8118fc3e>] ? shmem_getpage_gfp+0x29e/0xa70
>> [<ffffffff81190e4f>] shmem_fault+0x4f/0xa0
>> [<ffffffff8119f391>] __do_fault+0x71/0x5c0
>> [<ffffffff810e1ac6>] ? __lock_acquire+0x306/0x1ba0
>> [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
>> [<ffffffff811a2767>] handle_pte_fault+0x97/0xae0
>> [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
>> [<ffffffff8136d68e>] ? delay_tsc+0xae/0x120
>> [<ffffffff8136d578>] ? __const_udelay+0x28/0x30
>> [<ffffffff811a4a39>] handle_mm_fault+0x289/0x350
>> [<ffffffff816d091e>] __do_page_fault+0x18e/0x530
>> [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
>> [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
>> [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
>> [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
>> [<ffffffff8112d389>] ? rcu_user_exit+0xc9/0xf0
>> [<ffffffff816d0ceb>] do_page_fault+0x2b/0x50
>> [<ffffffff816cd3b8>] page_fault+0x28/0x30
>> [<ffffffff8136d259>] ? copy_user_enhanced_fast_string+0x9/0x20
>> [<ffffffff8121c181>] ? sys_futimesat+0x41/0xe0
>> [<ffffffff8102bf35>] ? syscall_trace_enter+0x25/0x2c0
>> [<ffffffff816d5625>] ? tracesys+0x7e/0xe6
>> [<ffffffff816d5688>] tracesys+0xe1/0xe6
>>
>>
>>
>> 1148 error = shmem_add_to_page_cache(page, mapping, index,
>> 1149 gfp, swp_to_radix_entry(swap));
>> 1150 /* We already confirmed swap, and make no allocation */
>> 1151 VM_BUG_ON(error);
>> 1152 }
> That's very surprising. Easy enough to handle an error there, but
> of course I made it a VM_BUG_ON because it violates my assumptions:
> I rather need to understand how this can be, and I've no idea.
>
> Clutching at straws, I expect this is entirely irrelevant, but:
> there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
> in current linux.git; rather, there's a VM_BUG_ON on line 1149.
>
> So you've inserted a couple of lines for some reason (more useful
> trinity behaviour, perhaps)? And have some config option I'm
> unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?

Hi Hugh,

I think it maybe caused by your commit [d189922862e03ce: shmem: fix
negative rss in memcg memory.stat], one question:

if function shmem_confirm_swap confirm the entry has already brought
back from swap by a racing thread, then why call shmem_add_to_page_cache
to add page from swapcache to pagecache again? otherwise, will goto
unlock and then go to repeat? where I miss?

Regards,
Chen

>
> Hugh
>
>>
>> total used free shared buffers cached
>> Mem: 3885528 2854064 1031464 0 9624 19208
>> -/+ buffers/cache: 2825232 1060296
>> Swap: 6029308 30656 5998652
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-25 06:59:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 25 Oct 2012, Ni zhan Chen wrote:
> On 10/25/2012 12:36 PM, Hugh Dickins wrote:
> > On Wed, 24 Oct 2012, Dave Jones wrote:
> >
> > > Machine under significant load (4gb memory used, swap usage fluctuating)
> > > triggered this...
> > >
> > > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > > Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
> > >
> > > 1148 error = shmem_add_to_page_cache(page,
> > > mapping, index,
> > > 1149 gfp,
> > > swp_to_radix_entry(swap));
> > > 1150 /* We already confirmed swap, and make no
> > > allocation */
> > > 1151 VM_BUG_ON(error);
> > > 1152 }
> > That's very surprising. Easy enough to handle an error there, but
> > of course I made it a VM_BUG_ON because it violates my assumptions:
> > I rather need to understand how this can be, and I've no idea.
> >
> > Clutching at straws, I expect this is entirely irrelevant, but:
> > there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
> > in current linux.git; rather, there's a VM_BUG_ON on line 1149.
> >
> > So you've inserted a couple of lines for some reason (more useful
> > trinity behaviour, perhaps)? And have some config option I'm
> > unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?
>
> Hi Hugh,
>
> I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative
> rss in memcg memory.stat], one question:

Well, yes, I added the VM_BUG_ON in that commit.

>
> if function shmem_confirm_swap confirm the entry has already brought back
> from swap by a racing thread,

The reverse: true confirms that the swap entry has not been brought back
from swap by a racing thread; false indicates that there has been a race.

> then why call shmem_add_to_page_cache to add
> page from swapcache to pagecache again?

Adding it to pagecache again, after such a race, would set error to
-EEXIST (originating from radix_tree_insert); but we don't do that,
we add it to pagecache when it has not already been added.

Or that's the intention: but Dave seems to have found an unexpected
exception, despite us holding the page lock across all this.

(But if it weren't for the memcg and replace_page issues, I'd much
prefer to let shmem_add_to_page_cache discover the race as before.)

Hugh

> otherwise, will goto unlock and then go to repeat? where I miss?
>
> Regards,
> Chen

2012-10-25 10:21:45

by Ni zhan Chen

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On 10/25/2012 02:59 PM, Hugh Dickins wrote:
> On Thu, 25 Oct 2012, Ni zhan Chen wrote:
>> On 10/25/2012 12:36 PM, Hugh Dickins wrote:
>>> On Wed, 24 Oct 2012, Dave Jones wrote:
>>>
>>>> Machine under significant load (4gb memory used, swap usage fluctuating)
>>>> triggered this...
>>>>
>>>> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
>>>> Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
>>>>
>>>> 1148 error = shmem_add_to_page_cache(page,
>>>> mapping, index,
>>>> 1149 gfp,
>>>> swp_to_radix_entry(swap));
>>>> 1150 /* We already confirmed swap, and make no
>>>> allocation */
>>>> 1151 VM_BUG_ON(error);
>>>> 1152 }
>>> That's very surprising. Easy enough to handle an error there, but
>>> of course I made it a VM_BUG_ON because it violates my assumptions:
>>> I rather need to understand how this can be, and I've no idea.
>>>
>>> Clutching at straws, I expect this is entirely irrelevant, but:
>>> there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
>>> in current linux.git; rather, there's a VM_BUG_ON on line 1149.
>>>
>>> So you've inserted a couple of lines for some reason (more useful
>>> trinity behaviour, perhaps)? And have some config option I'm
>>> unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?
>> Hi Hugh,
>>
>> I think it maybe caused by your commit [d189922862e03ce: shmem: fix negative
>> rss in memcg memory.stat], one question:
> Well, yes, I added the VM_BUG_ON in that commit.
>
>> if function shmem_confirm_swap confirm the entry has already brought back
>> from swap by a racing thread,
> The reverse: true confirms that the swap entry has not been brought back
> from swap by a racing thread; false indicates that there has been a race.
>
>> then why call shmem_add_to_page_cache to add
>> page from swapcache to pagecache again?
> Adding it to pagecache again, after such a race, would set error to
> -EEXIST (originating from radix_tree_insert); but we don't do that,
> we add it to pagecache when it has not already been added.
>
> Or that's the intention: but Dave seems to have found an unexpected
> exception, despite us holding the page lock across all this.
>
> (But if it weren't for the memcg and replace_page issues, I'd much
> prefer to let shmem_add_to_page_cache discover the race as before.)
>
> Hugh

Hi Hugh

Thanks for your response. You mean the -EEXIST originating from
radix_tree_insert, in radix_tree_insert:
if (slot != NULL)
return -EEXIST;
But why slot should be NULL? if no race, the pagecache related radix
tree entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val,
where I miss?

Regards,
Chen

>
>> otherwise, will goto unlock and then go to repeat? where I miss?
>>
>> Regards,
>> Chen

2012-10-25 11:14:18

by Dave Jones

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:

> > 1148 error = shmem_add_to_page_cache(page, mapping, index,
> > 1149 gfp, swp_to_radix_entry(swap));
> > 1150 /* We already confirmed swap, and make no allocation */
> > 1151 VM_BUG_ON(error);
> > 1152 }
>
> That's very surprising. Easy enough to handle an error there, but
> of course I made it a VM_BUG_ON because it violates my assumptions:
> I rather need to understand how this can be, and I've no idea.
>
> Clutching at straws, I expect this is entirely irrelevant, but:
> there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
> in current linux.git; rather, there's a VM_BUG_ON on line 1149.
>
> So you've inserted a couple of lines for some reason (more useful
> trinity behaviour, perhaps)?

detritus from the recent mpol_to_str bug that I was chasing.
Shouldn't be relevant...

diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm>
--- src/git-trees/kernel/linux/mm/shmem.c 2012-10-12 10:01:46.613408580 ->
+++ linux-dj/mm/shmem.c 2012-10-15 12:31:32.979653309 -0400
@@ -885,13 +885,15 @@ redirty:
static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
char buffer[64];
+ int ret;

if (!mpol || mpol->mode == MPOL_DEFAULT)
return; /* show nothing */

- mpol_to_str(buffer, sizeof(buffer), mpol, 1);
-
- seq_printf(seq, ",mpol=%s", buffer);
+ memset(buffer, 0, sizeof(buffer));
+ ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1);
+ if (ret > 0)
+ seq_printf(seq, ",mpol=%s", buffer);
}


> And have some config option I'm
> unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?

Yes, I do have this..

-#define VM_BUG_ON(cond) BUG_ON(cond)
+#define VM_BUG_ON(cond) WARN_ON(cond)

because I got tired of things not going over my usb serial port when I hit them
a while ago. BUG_ON is pretty unfriendly to bug finding.

Dave

2012-10-25 20:52:23

by Johannes Weiner

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:
> On Wed, 24 Oct 2012, Dave Jones wrote:
>
> > Machine under significant load (4gb memory used, swap usage fluctuating)
> > triggered this...
> >
> > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
> > Call Trace:
> > [<ffffffff8107100f>] warn_slowpath_common+0x7f/0xc0
> > [<ffffffff8107106a>] warn_slowpath_null+0x1a/0x20
> > [<ffffffff811903fc>] shmem_getpage_gfp+0xa5c/0xa70
> > [<ffffffff8118fc3e>] ? shmem_getpage_gfp+0x29e/0xa70
> > [<ffffffff81190e4f>] shmem_fault+0x4f/0xa0
> > [<ffffffff8119f391>] __do_fault+0x71/0x5c0
> > [<ffffffff810e1ac6>] ? __lock_acquire+0x306/0x1ba0
> > [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
> > [<ffffffff811a2767>] handle_pte_fault+0x97/0xae0
> > [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
> > [<ffffffff8136d68e>] ? delay_tsc+0xae/0x120
> > [<ffffffff8136d578>] ? __const_udelay+0x28/0x30
> > [<ffffffff811a4a39>] handle_mm_fault+0x289/0x350
> > [<ffffffff816d091e>] __do_page_fault+0x18e/0x530
> > [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
> > [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
> > [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
> > [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
> > [<ffffffff8112d389>] ? rcu_user_exit+0xc9/0xf0
> > [<ffffffff816d0ceb>] do_page_fault+0x2b/0x50
> > [<ffffffff816cd3b8>] page_fault+0x28/0x30
> > [<ffffffff8136d259>] ? copy_user_enhanced_fast_string+0x9/0x20
> > [<ffffffff8121c181>] ? sys_futimesat+0x41/0xe0
> > [<ffffffff8102bf35>] ? syscall_trace_enter+0x25/0x2c0
> > [<ffffffff816d5625>] ? tracesys+0x7e/0xe6
> > [<ffffffff816d5688>] tracesys+0xe1/0xe6
> >
> >
> >
> > 1148 error = shmem_add_to_page_cache(page, mapping, index,
> > 1149 gfp, swp_to_radix_entry(swap));
> > 1150 /* We already confirmed swap, and make no allocation */
> > 1151 VM_BUG_ON(error);
> > 1152 }
>
> That's very surprising. Easy enough to handle an error there, but
> of course I made it a VM_BUG_ON because it violates my assumptions:
> I rather need to understand how this can be, and I've no idea.

Could it be concurrent truncation clearing out the entry between
shmem_confirm_swap() and shmem_add_to_page_cache()? I don't see
anything preventing that.

The empty slot would not match the expected swap entry this call
passes in and the returned error would be -ENOENT.

2012-10-25 21:27:48

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 25 Oct 2012, Ni zhan Chen wrote:
> On 10/25/2012 02:59 PM, Hugh Dickins wrote:
> > On Thu, 25 Oct 2012, Ni zhan Chen wrote:
> > >
> > > I think it maybe caused by your commit [d189922862e03ce: shmem: fix
> > > negative
> > > rss in memcg memory.stat], one question:
> > Well, yes, I added the VM_BUG_ON in that commit.
> >
> > > if function shmem_confirm_swap confirm the entry has already brought back
> > > from swap by a racing thread,
> > The reverse: true confirms that the swap entry has not been brought back
> > from swap by a racing thread; false indicates that there has been a race.
> >
> > > then why call shmem_add_to_page_cache to add
> > > page from swapcache to pagecache again?
> > Adding it to pagecache again, after such a race, would set error to
> > -EEXIST (originating from radix_tree_insert); but we don't do that,
> > we add it to pagecache when it has not already been added.
> >
> > Or that's the intention: but Dave seems to have found an unexpected
> > exception, despite us holding the page lock across all this.
> >
> > (But if it weren't for the memcg and replace_page issues, I'd much
> > prefer to let shmem_add_to_page_cache discover the race as before.)
> >
> > Hugh
>
> Hi Hugh
>
> Thanks for your response. You mean the -EEXIST originating from
> radix_tree_insert, in radix_tree_insert:
> if (slot != NULL)
> return -EEXIST;
> But why slot should be NULL? if no race, the pagecache related radix tree
> entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss?

I was describing what would happen in a case that should not exist,
that you had thought the common case. In actuality, the entry should
not be NULL, it should be as you say there.

Hugh

2012-10-25 21:29:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 25 Oct 2012, Dave Jones wrote:
> On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:
>
> > Clutching at straws, I expect this is entirely irrelevant, but:
> > there isn't a warning on line 1151 of mm/shmem.c in 3.7.0-rc2 nor
> > in current linux.git; rather, there's a VM_BUG_ON on line 1149.
> >
> > So you've inserted a couple of lines for some reason (more useful
> > trinity behaviour, perhaps)?
>
> detritus from the recent mpol_to_str bug that I was chasing.
> Shouldn't be relevant...
>
> > And have some config option I'm
> > unfamiliar with, that mutates a BUG_ON or VM_BUG_ON into a warning?
>
> Yes, I do have this..
>
> -#define VM_BUG_ON(cond) BUG_ON(cond)
> +#define VM_BUG_ON(cond) WARN_ON(cond)
>
> because I got tired of things not going over my usb serial port when I hit them
> a while ago. BUG_ON is pretty unfriendly to bug finding.

Makes sense, thanks for the reassurance.

Hugh

2012-10-25 21:48:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 25 Oct 2012, Johannes Weiner wrote:
> On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:
> > On Wed, 24 Oct 2012, Dave Jones wrote:
> >
> > > Machine under significant load (4gb memory used, swap usage fluctuating)
> > > triggered this...
> > >
> > > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > > Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
> > >
> > > 1148 error = shmem_add_to_page_cache(page, mapping, index,
> > > 1149 gfp, swp_to_radix_entry(swap));
> > > 1150 /* We already confirmed swap, and make no allocation */
> > > 1151 VM_BUG_ON(error);
> > > 1152 }
> >
> > That's very surprising. Easy enough to handle an error there, but
> > of course I made it a VM_BUG_ON because it violates my assumptions:
> > I rather need to understand how this can be, and I've no idea.
>
> Could it be concurrent truncation clearing out the entry between
> shmem_confirm_swap() and shmem_add_to_page_cache()? I don't see
> anything preventing that.
>
> The empty slot would not match the expected swap entry this call
> passes in and the returned error would be -ENOENT.

Excellent notion, many thanks Hannes, I believe you've got it.

I've hit that truncation problem in swapoff (and commented on it
in shmem_unuse_inode), but never hit it or considered it here.
I think of the page lock as holding it stable, but truncation's
free_swap_and_cache only does a trylock on the swapcache page,
so we're not secured against that possibility.

So I'd like to change it to VM_BUG_ON(error && error != -ENOENT),
but there's a little tidying up to do in the -ENOENT case, which
needs more thought. A delete_from_swap_cache(page) - though we
can be lazy and leave that to reclaim for such a rare occurrence -
and probably a mem_cgroup uncharge; but the memcg hooks are always
the hardest to get right, I'll have think about that one carefully.

Hugh

2012-10-26 01:48:30

by Ni zhan Chen

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On 10/26/2012 05:27 AM, Hugh Dickins wrote:
> On Thu, 25 Oct 2012, Ni zhan Chen wrote:
>> On 10/25/2012 02:59 PM, Hugh Dickins wrote:
>>> On Thu, 25 Oct 2012, Ni zhan Chen wrote:
>>>> I think it maybe caused by your commit [d189922862e03ce: shmem: fix
>>>> negative
>>>> rss in memcg memory.stat], one question:
>>> Well, yes, I added the VM_BUG_ON in that commit.
>>>
>>>> if function shmem_confirm_swap confirm the entry has already brought back
>>>> from swap by a racing thread,
>>> The reverse: true confirms that the swap entry has not been brought back
>>> from swap by a racing thread; false indicates that there has been a race.
>>>
>>>> then why call shmem_add_to_page_cache to add
>>>> page from swapcache to pagecache again?
>>> Adding it to pagecache again, after such a race, would set error to
>>> -EEXIST (originating from radix_tree_insert); but we don't do that,
>>> we add it to pagecache when it has not already been added.
>>>
>>> Or that's the intention: but Dave seems to have found an unexpected
>>> exception, despite us holding the page lock across all this.
>>>
>>> (But if it weren't for the memcg and replace_page issues, I'd much
>>> prefer to let shmem_add_to_page_cache discover the race as before.)
>>>
>>> Hugh
>> Hi Hugh
>>
>> Thanks for your response. You mean the -EEXIST originating from
>> radix_tree_insert, in radix_tree_insert:
>> if (slot != NULL)
>> return -EEXIST;
>> But why slot should be NULL? if no race, the pagecache related radix tree
>> entry should be RADIX_TREE_EXCEPTIONAL_ENTRY+swap_entry_t.val, where I miss?
> I was describing what would happen in a case that should not exist,
> that you had thought the common case. In actuality, the entry should
> not be NULL, it should be as you say there.

Thanks for your patience. So in the common case, the entry should be the
value I mentioned, then why has this check?
if (slot != NULL)
return -EEXIST;

the common case will return -EEXIST.

>
> Hugh
>

2012-10-26 02:15:19

by Ni zhan Chen

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On 10/26/2012 05:48 AM, Hugh Dickins wrote:
> On Thu, 25 Oct 2012, Johannes Weiner wrote:
>> On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:
>>> On Wed, 24 Oct 2012, Dave Jones wrote:
>>>
>>>> Machine under significant load (4gb memory used, swap usage fluctuating)
>>>> triggered this...
>>>>
>>>> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
>>>> Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
>>>>
>>>> 1148 error = shmem_add_to_page_cache(page, mapping, index,
>>>> 1149 gfp, swp_to_radix_entry(swap));
>>>> 1150 /* We already confirmed swap, and make no allocation */
>>>> 1151 VM_BUG_ON(error);
>>>> 1152 }
>>> That's very surprising. Easy enough to handle an error there, but
>>> of course I made it a VM_BUG_ON because it violates my assumptions:
>>> I rather need to understand how this can be, and I've no idea.
>> Could it be concurrent truncation clearing out the entry between
>> shmem_confirm_swap() and shmem_add_to_page_cache()? I don't see
>> anything preventing that.
>>
>> The empty slot would not match the expected swap entry this call
>> passes in and the returned error would be -ENOENT.
> Excellent notion, many thanks Hannes, I believe you've got it.
>
> I've hit that truncation problem in swapoff (and commented on it
> in shmem_unuse_inode), but never hit it or considered it here.
> I think of the page lock as holding it stable, but truncation's
> free_swap_and_cache only does a trylock on the swapcache page,
> so we're not secured against that possibility.

Hi Hugh,

Even though free_swap_and_cache only does a trylock on the swapcache
page, but it doens't call delete_from_swap_cache and the associated
entry should still be there, I am interested in what you have already
introduce to protect it?

>
> So I'd like to change it to VM_BUG_ON(error && error != -ENOENT),
> but there's a little tidying up to do in the -ENOENT case, which

Do you mean radix_tree_insert will return -ENOENT if the associated
entry is not present? Why I can't find this return value in the function
radix_tree_insert?

> needs more thought. A delete_from_swap_cache(page) - though we
> can be lazy and leave that to reclaim for such a rare occurrence -
> and probably a mem_cgroup uncharge; but the memcg hooks are always
> the hardest to get right, I'll have think about that one carefully.
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-11-01 19:11:00

by Dave Jones

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:
> On Wed, 24 Oct 2012, Dave Jones wrote:
>
> > Machine under significant load (4gb memory used, swap usage fluctuating)
> > triggered this...
> >
> > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
> > Call Trace:
> > [<ffffffff8107100f>] warn_slowpath_common+0x7f/0xc0
> > [<ffffffff8107106a>] warn_slowpath_null+0x1a/0x20
> > [<ffffffff811903fc>] shmem_getpage_gfp+0xa5c/0xa70
> > [<ffffffff8118fc3e>] ? shmem_getpage_gfp+0x29e/0xa70
> > [<ffffffff81190e4f>] shmem_fault+0x4f/0xa0
> > [<ffffffff8119f391>] __do_fault+0x71/0x5c0
> > [<ffffffff810e1ac6>] ? __lock_acquire+0x306/0x1ba0
> > [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
> > [<ffffffff811a2767>] handle_pte_fault+0x97/0xae0
> > [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
> > [<ffffffff8136d68e>] ? delay_tsc+0xae/0x120
> > [<ffffffff8136d578>] ? __const_udelay+0x28/0x30
> > [<ffffffff811a4a39>] handle_mm_fault+0x289/0x350
> > [<ffffffff816d091e>] __do_page_fault+0x18e/0x530
> > [<ffffffff810b6ff9>] ? local_clock+0x89/0xa0
> > [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
> > [<ffffffff810b0e51>] ? get_parent_ip+0x11/0x50
> > [<ffffffff816d1069>] ? sub_preempt_count+0x79/0xd0
> > [<ffffffff8112d389>] ? rcu_user_exit+0xc9/0xf0
> > [<ffffffff816d0ceb>] do_page_fault+0x2b/0x50
> > [<ffffffff816cd3b8>] page_fault+0x28/0x30
> > [<ffffffff8136d259>] ? copy_user_enhanced_fast_string+0x9/0x20
> > [<ffffffff8121c181>] ? sys_futimesat+0x41/0xe0
> > [<ffffffff8102bf35>] ? syscall_trace_enter+0x25/0x2c0
> > [<ffffffff816d5625>] ? tracesys+0x7e/0xe6
> > [<ffffffff816d5688>] tracesys+0xe1/0xe6
> >
> >
> >
> > 1148 error = shmem_add_to_page_cache(page, mapping, index,
> > 1149 gfp, swp_to_radix_entry(swap));
> > 1150 /* We already confirmed swap, and make no allocation */
> > 1151 VM_BUG_ON(error);
> > 1152 }
>
> That's very surprising. Easy enough to handle an error there, but
> of course I made it a VM_BUG_ON because it violates my assumptions:
> I rather need to understand how this can be, and I've no idea.

I just noticed we had a user report hitting this same warning, but
with a different trace..

: [<ffffffff8105b84f>] warn_slowpath_common+0x7f/0xc0
: [<ffffffff8105b8aa>] warn_slowpath_null+0x1a/0x20
: [<ffffffff81143c73>] shmem_getpage_gfp+0x7f3/0x830
: [<ffffffff81158c9d>] ? vma_adjust+0x3ed/0x620
: [<ffffffff81143f02>] shmem_file_aio_read+0x1f2/0x380
: [<ffffffff8118e487>] do_sync_read+0xa7/0xe0
: [<ffffffff8118eda9>] vfs_read+0xa9/0x180
: [<ffffffff8118eeca>] sys_read+0x4a/0x90
: [<ffffffff816226e9>] system_call_fastpath+0x16/0x1b

Dave

2012-11-01 23:03:48

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 1 Nov 2012, Dave Jones wrote:
> On Wed, Oct 24, 2012 at 09:36:27PM -0700, Hugh Dickins wrote:
> > On Wed, 24 Oct 2012, Dave Jones wrote:
> >
> > > Machine under significant load (4gb memory used, swap usage fluctuating)
> > > triggered this...
> > >
> > > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > >
> > > 1148 error = shmem_add_to_page_cache(page, mapping, index,
> > > 1149 gfp, swp_to_radix_entry(swap));
> > > 1150 /* We already confirmed swap, and make no allocation */
> > > 1151 VM_BUG_ON(error);
> > > 1152 }
> >
> > That's very surprising. Easy enough to handle an error there, but
> > of course I made it a VM_BUG_ON because it violates my assumptions:
> > I rather need to understand how this can be, and I've no idea.
>
> I just noticed we had a user report hitting this same warning, but
> with a different trace..
>
> : [<ffffffff8105b84f>] warn_slowpath_common+0x7f/0xc0
> : [<ffffffff8105b8aa>] warn_slowpath_null+0x1a/0x20
> : [<ffffffff81143c73>] shmem_getpage_gfp+0x7f3/0x830
> : [<ffffffff81158c9d>] ? vma_adjust+0x3ed/0x620
> : [<ffffffff81143f02>] shmem_file_aio_read+0x1f2/0x380
> : [<ffffffff8118e487>] do_sync_read+0xa7/0xe0
> : [<ffffffff8118eda9>] vfs_read+0xa9/0x180
> : [<ffffffff8118eeca>] sys_read+0x4a/0x90
> : [<ffffffff816226e9>] system_call_fastpath+0x16/0x1b

Equally explicable by Hannes's hypothesis;
but useful supporting evidence, thank you.

Except... earlier in the thread you explained how you hacked
#define VM_BUG_ON(cond) WARN_ON(cond)
to get this to come out as a warning instead of a bug,
and now it looks as if "a user" has here done the same.

Which is very much a user's right, of course; but does
make me wonder whether that user might actually be davej ;)

Never mind, whatever, it's more justification for the fix - which
I've honestly not forgotten, but somehow not got around to sending
(with a couple of others even longer outstanding). On its way
shortly, for some unpredictable value of shortly.

Hugh

2012-11-01 23:20:39

by Dave Jones

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, Nov 01, 2012 at 04:03:40PM -0700, Hugh Dickins wrote:
> > I just noticed we had a user report hitting this same warning, but
> > with a different trace..
> >
> > : [<ffffffff8105b84f>] warn_slowpath_common+0x7f/0xc0
> > : [<ffffffff8105b8aa>] warn_slowpath_null+0x1a/0x20
> > : [<ffffffff81143c73>] shmem_getpage_gfp+0x7f3/0x830
> > : [<ffffffff81158c9d>] ? vma_adjust+0x3ed/0x620
> > : [<ffffffff81143f02>] shmem_file_aio_read+0x1f2/0x380
> > : [<ffffffff8118e487>] do_sync_read+0xa7/0xe0
> > : [<ffffffff8118eda9>] vfs_read+0xa9/0x180
> > : [<ffffffff8118eeca>] sys_read+0x4a/0x90
> > : [<ffffffff816226e9>] system_call_fastpath+0x16/0x1b
>
> Equally explicable by Hannes's hypothesis;
> but useful supporting evidence, thank you.
>
> Except... earlier in the thread you explained how you hacked
> #define VM_BUG_ON(cond) WARN_ON(cond)
> to get this to come out as a warning instead of a bug,
> and now it looks as if "a user" has here done the same.
>
> Which is very much a user's right, of course; but does
> make me wonder whether that user might actually be davej ;)

indirectly. I made the same change in the Fedora kernel a while ago
to test a hypothesis that we weren't getting any VM_BUG_ON reports.

Dave

2012-11-01 23:48:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 1 Nov 2012, Dave Jones wrote:
> On Thu, Nov 01, 2012 at 04:03:40PM -0700, Hugh Dickins wrote:
> >
> > Except... earlier in the thread you explained how you hacked
> > #define VM_BUG_ON(cond) WARN_ON(cond)
> > to get this to come out as a warning instead of a bug,
> > and now it looks as if "a user" has here done the same.
> >
> > Which is very much a user's right, of course; but does
> > make me wonder whether that user might actually be davej ;)
>
> indirectly. I made the same change in the Fedora kernel a while ago
> to test a hypothesis that we weren't getting any VM_BUG_ON reports.

Fedora turns on CONFIG_DEBUG_VM?

All mm developers should thank you for the wider testing exposure;
but I'm not so sure that Fedora users should thank you for turning
it on - really it's for mm developers to wrap around !assertions or
more expensive checks (e.g. checking calls) in their development.

Or did I read a few months ago that some change had been made to
such definitions, and VM_BUG_ON(contents) are evaluated even when
the config option is off? I do hope I'm mistaken on that.

Hugh

2012-11-02 01:43:44

by Dave Jones

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, Nov 01, 2012 at 04:48:41PM -0700, Hugh Dickins wrote:
> On Thu, 1 Nov 2012, Dave Jones wrote:
> > On Thu, Nov 01, 2012 at 04:03:40PM -0700, Hugh Dickins wrote:
> > >
> > > Except... earlier in the thread you explained how you hacked
> > > #define VM_BUG_ON(cond) WARN_ON(cond)
> > > to get this to come out as a warning instead of a bug,
> > > and now it looks as if "a user" has here done the same.
> > >
> > > Which is very much a user's right, of course; but does
> > > make me wonder whether that user might actually be davej ;)
> >
> > indirectly. I made the same change in the Fedora kernel a while ago
> > to test a hypothesis that we weren't getting any VM_BUG_ON reports.
>
> Fedora turns on CONFIG_DEBUG_VM?

Yes.

> All mm developers should thank you for the wider testing exposure;
> but I'm not so sure that Fedora users should thank you for turning
> it on - really it's for mm developers to wrap around !assertions or
> more expensive checks (e.g. checking calls) in their development.

The last time I did some benchmarking the impact wasn't as ridiculous
as say lockdep, or spinlock debug. Maybe the benchmarks I was using
weren't pushing the VM very hard, but it seemed to me that the value
in getting info in potential problems early was higher than a small
performance increase.

> Or did I read a few months ago that some change had been made to
> such definitions, and VM_BUG_ON(contents) are evaluated even when
> the config option is off? I do hope I'm mistaken on that.

Pretty sure that isn't the case. I remember Andrew chastising people
a few times for putting checks in VM_BUG_ON's that needed to stay around
even when the config option was off. Perhaps you were thinking of one
of those incidents ?

Dave

2012-11-02 23:26:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: shmem_getpage_gfp VM_BUG_ON triggered. [3.7rc2]

On Thu, 1 Nov 2012, Dave Jones wrote:
> On Thu, Nov 01, 2012 at 04:48:41PM -0700, Hugh Dickins wrote:
> >
> > Fedora turns on CONFIG_DEBUG_VM?
>
> Yes.
>
> > All mm developers should thank you for the wider testing exposure;
> > but I'm not so sure that Fedora users should thank you for turning
> > it on - really it's for mm developers to wrap around !assertions or
> > more expensive checks (e.g. checking calls) in their development.
>
> The last time I did some benchmarking the impact wasn't as ridiculous
> as say lockdep, or spinlock debug.

I think you're safe to assume that (outside of an individual developer's
private tree) it will never be nearly as heavy as lockdep or debug
pagealloc. I hadn't thought of spinlock debug as a heavy one, but
yes, I guess it would be heavier than almost all VM_BUG_ON()s.

> Maybe the benchmarks I was using
> weren't pushing the VM very hard, but it seemed to me that the value
> in getting info in potential problems early was higher than a small
> performance increase.

We thank you. I may have been over-estimating how much we put inside
those VM_BUG_ON()s, sorry. Just so long as you're aware that there's
a danger that one day we might slip something heavier in there.

Those few explicit #ifdef CONFIG_DEBUG_VMs sometimes found in mm/
are probably the worst: you might want to check on the current crop.

>
> > Or did I read a few months ago that some change had been made to
> > such definitions, and VM_BUG_ON(contents) are evaluated even when
> > the config option is off? I do hope I'm mistaken on that.
>
> Pretty sure that isn't the case. I remember Andrew chastising people
> a few times for putting checks in VM_BUG_ON's that needed to stay around
> even when the config option was off. Perhaps you were thinking of one
> of those incidents ?

Avoiding side-effects in BUG_ON and VM_BUG_ON. Yes, that comes up
from time to time, and I'm a believer on that. I think the discussion
I'm mis/remembering sprung out of one of those: someone was surprised
by the disassembly they found when it was configured off.

The correct answer is to try it for myself and see. Not today.

Hugh

2012-11-06 01:32:48

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

Fuzzing with trinity hit the "impossible" VM_BUG_ON(error)
(which Fedora has converted to WARNING) in shmem_getpage_gfp():

WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
Call Trace:
[<ffffffff8107100f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8107106a>] warn_slowpath_null+0x1a/0x20
[<ffffffff811903fc>] shmem_getpage_gfp+0xa5c/0xa70
[<ffffffff81190e4f>] shmem_fault+0x4f/0xa0
[<ffffffff8119f391>] __do_fault+0x71/0x5c0
[<ffffffff811a2767>] handle_pte_fault+0x97/0xae0
[<ffffffff811a4a39>] handle_mm_fault+0x289/0x350
[<ffffffff816d091e>] __do_page_fault+0x18e/0x530
[<ffffffff816d0ceb>] do_page_fault+0x2b/0x50
[<ffffffff816cd3b8>] page_fault+0x28/0x30
[<ffffffff816d5688>] tracesys+0xe1/0xe6

Thanks to Johannes for pointing to truncation: free_swap_and_cache()
only does a trylock on the page, so the page lock we've held since
before confirming swap is not enough to protect against truncation.

What cleanup is needed in this case? Just delete_from_swap_cache(),
which takes care of the memcg uncharge.

Reported-by: Dave Jones <[email protected]>
Hypothesis-by: Johannes Weiner <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

mm/shmem.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

--- 3.7-rc4/mm/shmem.c 2012-10-14 16:16:58.361309122 -0700
+++ linux/mm/shmem.c 2012-11-01 14:31:04.288185742 -0700
@@ -1145,8 +1145,22 @@ repeat:
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
gfp, swp_to_radix_entry(swap));
- /* We already confirmed swap, and make no allocation */
- VM_BUG_ON(error);
+ /*
+ * We already confirmed swap under page lock, and make
+ * no memory allocation here, so usually no possibility
+ * of error; but free_swap_and_cache() only trylocks a
+ * page, so it is just possible that the entry has been
+ * truncated or holepunched since swap was confirmed.
+ * shmem_undo_range() will have done some of the
+ * unaccounting, now delete_from_swap_cache() will do
+ * the rest (including mem_cgroup_uncharge_swapcache).
+ * Reset swap.val? No, leave it so "failed" goes back to
+ * "repeat": reading a hole and writing should succeed.
+ */
+ if (error) {
+ VM_BUG_ON(error != -ENOENT);
+ delete_from_swap_cache(page);
+ }
}
if (error)
goto failed;

2012-11-06 13:54:13

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:

> - /* We already confirmed swap, and make no allocation */
> - VM_BUG_ON(error);
> + /*
> + * We already confirmed swap under page lock, and make
> + * no memory allocation here, so usually no possibility
> + * of error; but free_swap_and_cache() only trylocks a
> + * page, so it is just possible that the entry has been
> + * truncated or holepunched since swap was confirmed.
> + * shmem_undo_range() will have done some of the
> + * unaccounting, now delete_from_swap_cache() will do
> + * the rest (including mem_cgroup_uncharge_swapcache).
> + * Reset swap.val? No, leave it so "failed" goes back to
> + * "repeat": reading a hole and writing should succeed.
> + */
> + if (error) {
> + VM_BUG_ON(error != -ENOENT);
> + delete_from_swap_cache(page);
> + }
> }

I ran with this overnight, and still hit the (new!) VM_BUG_ON

Perhaps we should print out what 'error' was too ? I'll rebuild with that..

------------[ cut here ]------------
WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Hardware name: 2012 Client Platform
Modules linked in: fuse ipt_ULOG scsi_transport_iscsi binfmt_misc dn_rtmsg nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm netrom appletalk ipx p8023 p8022 psnap llc ax25 lockd sunrpc bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables gspca_ov519 gspca_main videodev kvm_intel usb_debug kvm crc32c_intel ghash_clmulni_intel microcode pcspkr i2c_i801 e1000e uinput i915 video i2c_algo_bit drm_kms_helper drm i2c_core
Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54
Call Trace:
[<ffffffff8107105f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff810710ba>] warn_slowpath_null+0x1a/0x20
[<ffffffff8119050c>] shmem_getpage_gfp+0xa5c/0xa70
[<ffffffff8118fd4e>] ? shmem_getpage_gfp+0x29e/0xa70
[<ffffffff81190f5f>] shmem_fault+0x4f/0xa0
[<ffffffff8119f4a1>] __do_fault+0x71/0x5c0
[<ffffffff810e1b26>] ? __lock_acquire+0x306/0x1ba0
[<ffffffff810b7059>] ? local_clock+0x89/0xa0
[<ffffffff811a2877>] handle_pte_fault+0x97/0xae0
[<ffffffff816d1969>] ? sub_preempt_count+0x79/0xd0
[<ffffffff8136dbfe>] ? delay_tsc+0xae/0x120
[<ffffffff8136dae8>] ? __const_udelay+0x28/0x30
[<ffffffff811a4b49>] handle_mm_fault+0x289/0x350
[<ffffffff816d121e>] __do_page_fault+0x18e/0x530
[<ffffffff811f2e70>] ? getname_flags.part.32+0x30/0x150
[<ffffffff811f2e70>] ? getname_flags.part.32+0x30/0x150
[<ffffffff811c9c5c>] ? set_track+0x8c/0x1a0
[<ffffffff816c3dd8>] ? __slab_alloc+0x531/0x59e
[<ffffffff810e471d>] ? trace_hardirqs_on_caller+0x15d/0x1e0
[<ffffffff8112d469>] ? rcu_user_exit+0xc9/0xf0
[<ffffffff816d15eb>] do_page_fault+0x2b/0x50
[<ffffffff816cdcb8>] page_fault+0x28/0x30
[<ffffffff81388e2c>] ? strncpy_from_user+0x6c/0x120
[<ffffffff811f2ec6>] getname_flags.part.32+0x86/0x150
[<ffffffff811f2fca>] getname+0x3a/0x60
[<ffffffff811f7aa4>] sys_symlinkat+0x24/0x90
[<ffffffff816d5f25>] ? tracesys+0x7e/0xe6
[<ffffffff816d5f88>] tracesys+0xe1/0xe6
---[ end trace 4ba438264ea16e97 ]---



Dave

2012-11-06 23:48:26

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On Tue, 6 Nov 2012, Dave Jones wrote:
> On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:
>
> > - /* We already confirmed swap, and make no allocation */
> > - VM_BUG_ON(error);
> > + /*
> > + * We already confirmed swap under page lock, and make
> > + * no memory allocation here, so usually no possibility
> > + * of error; but free_swap_and_cache() only trylocks a
> > + * page, so it is just possible that the entry has been
> > + * truncated or holepunched since swap was confirmed.
> > + * shmem_undo_range() will have done some of the
> > + * unaccounting, now delete_from_swap_cache() will do
> > + * the rest (including mem_cgroup_uncharge_swapcache).
> > + * Reset swap.val? No, leave it so "failed" goes back to
> > + * "repeat": reading a hole and writing should succeed.
> > + */
> > + if (error) {
> > + VM_BUG_ON(error != -ENOENT);
> > + delete_from_swap_cache(page);
> > + }
> > }
>
> I ran with this overnight,

Thanks a lot...

> and still hit the (new!) VM_BUG_ON

... but that's even more surprising than your original report.

>
> Perhaps we should print out what 'error' was too ? I'll rebuild with that..

Thanks; though I thought the error was going to turn out too boring,
and was preparing a debug patch for you to show the expected and found
values too. But then got very puzzled...

>
> ------------[ cut here ]------------
> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> Hardware name: 2012 Client Platform
> Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54

That's the very same line number as in your original report, despite
the long comment which the patch adds. Are you sure that kernel was
built with the patch in?

I wouldn't usually question you, but I'm going mad trying to understand
how the VM_BUG_ON(error != -ENOENT) fires. At the time I wrote that
line, and when I was preparing the debug patch, I was thinking that an
error from shmem_radix_tree_replace could also be -EEXIST, for when a
different something rather than nothing is found [*]. But that's not
the case, shmem_radix_tree_replace returns either 0 or -ENOENT.

So if error != -ENOENT, that means shmem_add_to_page_cache went the
radix_tree_insert route instead of the shmem_radix_tree_replace route;
which means that its 'expected' is NULL, so swp_to_radix_entry(swap)
is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt
the radix_tree might be, I do not understand the new VM_BUG_ON firing.

Please tell me it was the wrong kernel!
Hugh

[*] But in thinking it over, I realize that if shmem_radix_tree_replace
had returned -EEXIST for the "wrong something" case, I would have been
wrong to BUG on that; because just as truncation could remove an entry,
something else could immediately after instantiate a new page there.

So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's
not saying what I had intended to say with it, and would have been
wrong to say that anyway. It just looks stupid to me now, rather
like inserting a VM_BUG_ON(false) - but that does become interesting
when you report that you've hit it.

2012-11-07 22:38:40

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On Tue, Nov 06, 2012 at 03:48:20PM -0800, Hugh Dickins wrote:

> > ------------[ cut here ]------------
> > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > Hardware name: 2012 Client Platform
> > Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54
>
> That's the very same line number as in your original report, despite
> the long comment which the patch adds. Are you sure that kernel was
> built with the patch in?

I just changed the code by hand, and opted not to paste the comment in.

It is plausible that I built that kernel and forgot to reboot into it,
but I'm 99.9% sure that that wasn't the case.

Unfortunatly I can't check immediately, as that machine for reasons
unknown no longer wants to get past the BIOS POST check.

I'll see if I can reproduce it on a different test box until I get
that one back up.

Dave

2012-11-14 01:36:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON fix

We're still hoping to hear back from Dave Jones: but either way,
please fold this patch into the earlier fix for 3.7 and -stable.

Remove its VM_BUG_ON: because either it's as I believe, a tautology
which cannot happen, and does not assert what I'd intended when I put
it in, and would even be wrong if it did (a non-NULL entry can validly
materialize there); or Dave actually hit it on his updated kernel,
in which case more research will be needed, but for upstream we
do not want a user to BUG there.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

--- mmotm/mm/shmem.c 2012-11-09 09:43:46.908046342 -0800
+++ linux/mm/shmem.c 2012-11-13 17:16:38.532528959 -0800
@@ -1158,10 +1158,8 @@ repeat:
* Reset swap.val? No, leave it so "failed" goes back to
* "repeat": reading a hole and writing should succeed.
*/
- if (error) {
- VM_BUG_ON(error != -ENOENT);
+ if (error)
delete_from_swap_cache(page);
- }
}
if (error)
goto failed;

2012-11-14 03:07:18

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On 11/07/2012 07:48 AM, Hugh Dickins wrote:
> On Tue, 6 Nov 2012, Dave Jones wrote:
>> On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:
>>
>> > - /* We already confirmed swap, and make no allocation */
>> > - VM_BUG_ON(error);
>> > + /*
>> > + * We already confirmed swap under page lock, and make
>> > + * no memory allocation here, so usually no possibility
>> > + * of error; but free_swap_and_cache() only trylocks a
>> > + * page, so it is just possible that the entry has been
>> > + * truncated or holepunched since swap was confirmed.
>> > + * shmem_undo_range() will have done some of the
>> > + * unaccounting, now delete_from_swap_cache() will do
>> > + * the rest (including mem_cgroup_uncharge_swapcache).
>> > + * Reset swap.val? No, leave it so "failed" goes back to
>> > + * "repeat": reading a hole and writing should succeed.
>> > + */
>> > + if (error) {
>> > + VM_BUG_ON(error != -ENOENT);
>> > + delete_from_swap_cache(page);
>> > + }
>> > }
>>
>> I ran with this overnight,
> Thanks a lot...
>
>> and still hit the (new!) VM_BUG_ON
> ... but that's even more surprising than your original report.
>
>> Perhaps we should print out what 'error' was too ? I'll rebuild with that..
> Thanks; though I thought the error was going to turn out too boring,
> and was preparing a debug patch for you to show the expected and found
> values too. But then got very puzzled...
>
>> ------------[ cut here ]------------
>> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
>> Hardware name: 2012 Client Platform
>> Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54
> That's the very same line number as in your original report, despite
> the long comment which the patch adds. Are you sure that kernel was
> built with the patch in?
>
> I wouldn't usually question you, but I'm going mad trying to understand
> how the VM_BUG_ON(error != -ENOENT) fires. At the time I wrote that
> line, and when I was preparing the debug patch, I was thinking that an
> error from shmem_radix_tree_replace could also be -EEXIST, for when a
> different something rather than nothing is found [*]. But that's not
> the case, shmem_radix_tree_replace returns either 0 or -ENOENT.
>
> So if error != -ENOENT, that means shmem_add_to_page_cache went the
> radix_tree_insert route instead of the shmem_radix_tree_replace route;
> which means that its 'expected' is NULL, so swp_to_radix_entry(swap)
> is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt
> the radix_tree might be, I do not understand the new VM_BUG_ON firing.
>
> Please tell me it was the wrong kernel!
> Hugh
>
> [*] But in thinking it over, I realize that if shmem_radix_tree_replace
> had returned -EEXIST for the "wrong something" case, I would have been
> wrong to BUG on that; because just as truncation could remove an entry,
> something else could immediately after instantiate a new page there.

Hi Hugh,

As you said, swp_to_radix_entry() does an "| 2", so even if truncation
could remove an entry and something else could immediately after
instantiate a new page there, but the expected parameter will not be
NULL, the result is radix_tree_insert will not be called and
shmem_add_to_page_cache will not return -EEXIST, then why trigger BUG_ON ?

Regards,
Jaegeuk

> So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's
> not saying what I had intended to say with it, and would have been
> wrong to say that anyway. It just looks stupid to me now, rather
> like inserting a VM_BUG_ON(false) - but that does become interesting
> when you report that you've hit it.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-11-14 03:50:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On Wed, 14 Nov 2012, Jaegeuk Hanse wrote:
> On 11/07/2012 07:48 AM, Hugh Dickins wrote:
> > On Tue, 6 Nov 2012, Dave Jones wrote:
> > > On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:
> > >
> > > > - /* We already confirmed swap, and make no
> > > allocation */
> > > > - VM_BUG_ON(error);
> > > > + /*
> > > > + * We already confirmed swap under page lock,
> > > and make
> > > > + * no memory allocation here, so usually no
> > > possibility
> > > > + * of error; but free_swap_and_cache() only
> > > trylocks a
> > > > + * page, so it is just possible that the
> > > entry has been
> > > > + * truncated or holepunched since swap was
> > > confirmed.
> > > > + * shmem_undo_range() will have done some of
> > > the
> > > > + * unaccounting, now delete_from_swap_cache()
> > > will do
> > > > + * the rest (including
> > > mem_cgroup_uncharge_swapcache).
> > > > + * Reset swap.val? No, leave it so "failed"
> > > goes back to
> > > > + * "repeat": reading a hole and writing
> > > should succeed.
> > > > + */
> > > > + if (error) {
> > > > + VM_BUG_ON(error != -ENOENT);
> > > > + delete_from_swap_cache(page);
> > > > + }
> > > > }
> > >
> > > I ran with this overnight,
> > Thanks a lot...
> >
> > > and still hit the (new!) VM_BUG_ON
> > ... but that's even more surprising than your original report.
> >
> > > Perhaps we should print out what 'error' was too ? I'll rebuild with
> > > that..
> > Thanks; though I thought the error was going to turn out too boring,
> > and was preparing a debug patch for you to show the expected and found
> > values too. But then got very puzzled...
> >
> > > ------------[ cut here ]------------
> > > WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
> > > Hardware name: 2012 Client Platform
> > > Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54
> > That's the very same line number as in your original report, despite
> > the long comment which the patch adds. Are you sure that kernel was
> > built with the patch in?
> >
> > I wouldn't usually question you, but I'm going mad trying to understand
> > how the VM_BUG_ON(error != -ENOENT) fires. At the time I wrote that
> > line, and when I was preparing the debug patch, I was thinking that an
> > error from shmem_radix_tree_replace could also be -EEXIST, for when a
> > different something rather than nothing is found [*]. But that's not
> > the case, shmem_radix_tree_replace returns either 0 or -ENOENT.
> >
> > So if error != -ENOENT, that means shmem_add_to_page_cache went the
> > radix_tree_insert route instead of the shmem_radix_tree_replace route;
> > which means that its 'expected' is NULL, so swp_to_radix_entry(swap)
> > is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt
> > the radix_tree might be, I do not understand the new VM_BUG_ON firing.
> >
> > Please tell me it was the wrong kernel!
> > Hugh
> >
> > [*] But in thinking it over, I realize that if shmem_radix_tree_replace
> > had returned -EEXIST for the "wrong something" case, I would have been
> > wrong to BUG on that; because just as truncation could remove an entry,
> > something else could immediately after instantiate a new page there.
>
> Hi Hugh,
>
> As you said, swp_to_radix_entry() does an "| 2", so even if truncation could
> remove an entry and something else could immediately after instantiate a new
> page there, but the expected parameter will not be NULL, the result is
> radix_tree_insert will not be called and shmem_add_to_page_cache will not
> return -EEXIST, then why trigger BUG_ON ?

Why insert the VM_BUG_ON? Because at the time I thought that it
asserted something useful; but I was mistaken, as explained above.

How can the VM_BUG_ON trigger (without stack corruption, or something
of that kind)? I have no idea.

We are in agreement: I now think that VM_BUG_ON is misleading and silly,
and sent Andrew a further patch to remove it a just couple of hours ago.

Originally I was waiting to hear further from Dave; but his test
machine was giving trouble, and it occurred to me that, never mind
whether he says he has hit it again, or he has not hit it again,
the answer is the same: don't send that VM_BUG_ON upstream.

Hugh

>
> Regards,
> Jaegeuk
>
> > So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's
> > not saying what I had intended to say with it, and would have been
> > wrong to say that anyway. It just looks stupid to me now, rather
> > like inserting a VM_BUG_ON(false) - but that does become interesting
> > when you report that you've hit it.

2012-11-14 06:14:50

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On Tue, Nov 13, 2012 at 07:50:25PM -0800, Hugh Dickins wrote:

> Originally I was waiting to hear further from Dave; but his test
> machine was giving trouble, and it occurred to me that, never mind
> whether he says he has hit it again, or he has not hit it again,
> the answer is the same: don't send that VM_BUG_ON upstream.

Sorry, I'm supposedly on vacation.
That said, a replacement test box has been running tests since last Friday
without hitting that case. Maybe it was the last death throes of
that other machine before it gave up the ghost completely.

Does sound like an awful coincidence though.

Dave

2012-11-14 10:06:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On Wed, 14 Nov 2012, Dave Jones wrote:
> On Tue, Nov 13, 2012 at 07:50:25PM -0800, Hugh Dickins wrote:
>
> > Originally I was waiting to hear further from Dave; but his test
> > machine was giving trouble, and it occurred to me that, never mind
> > whether he says he has hit it again, or he has not hit it again,
> > the answer is the same: don't send that VM_BUG_ON upstream.
>
> Sorry, I'm supposedly on vacation.

Sorry for breaking in upon that, and thank you for responding even so.

> That said, a replacement test box has been running tests since last Friday
> without hitting that case. Maybe it was the last death throes of
> that other machine before it gave up the ghost completely.
>
> Does sound like an awful coincidence though.

I'm still clinging to your 0.1% possibility that it was not the
intended kernel running.

Anyway, I'm not going to worry about it further, until we see another
hit - please do keep the VM_BUG_ON in your test kernel (i.e. resist
that temptation to race in from your vacation to apply today's patch!),
even though the right thing for 3.7 was to remove it.

Thanks,
Hugh

2012-11-15 07:39:59

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On 11/14/2012 11:50 AM, Hugh Dickins wrote:
> On Wed, 14 Nov 2012, Jaegeuk Hanse wrote:
>> On 11/07/2012 07:48 AM, Hugh Dickins wrote:
>>> On Tue, 6 Nov 2012, Dave Jones wrote:
>>>> On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:
>>>>
>>>> > - /* We already confirmed swap, and make no
>>>> allocation */
>>>> > - VM_BUG_ON(error);
>>>> > + /*
>>>> > + * We already confirmed swap under page lock,
>>>> and make
>>>> > + * no memory allocation here, so usually no
>>>> possibility
>>>> > + * of error; but free_swap_and_cache() only
>>>> trylocks a
>>>> > + * page, so it is just possible that the
>>>> entry has been
>>>> > + * truncated or holepunched since swap was
>>>> confirmed.
>>>> > + * shmem_undo_range() will have done some of
>>>> the
>>>> > + * unaccounting, now delete_from_swap_cache()
>>>> will do
>>>> > + * the rest (including
>>>> mem_cgroup_uncharge_swapcache).
>>>> > + * Reset swap.val? No, leave it so "failed"
>>>> goes back to
>>>> > + * "repeat": reading a hole and writing
>>>> should succeed.
>>>> > + */
>>>> > + if (error) {
>>>> > + VM_BUG_ON(error != -ENOENT);
>>>> > + delete_from_swap_cache(page);
>>>> > + }
>>>> > }
>>>>
>>>> I ran with this overnight,
>>> Thanks a lot...
>>>
>>>> and still hit the (new!) VM_BUG_ON
>>> ... but that's even more surprising than your original report.
>>>
>>>> Perhaps we should print out what 'error' was too ? I'll rebuild with
>>>> that..
>>> Thanks; though I thought the error was going to turn out too boring,
>>> and was preparing a debug patch for you to show the expected and found
>>> values too. But then got very puzzled...
>>>
>>>> ------------[ cut here ]------------
>>>> WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
>>>> Hardware name: 2012 Client Platform
>>>> Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54
>>> That's the very same line number as in your original report, despite
>>> the long comment which the patch adds. Are you sure that kernel was
>>> built with the patch in?
>>>
>>> I wouldn't usually question you, but I'm going mad trying to understand
>>> how the VM_BUG_ON(error != -ENOENT) fires. At the time I wrote that
>>> line, and when I was preparing the debug patch, I was thinking that an
>>> error from shmem_radix_tree_replace could also be -EEXIST, for when a
>>> different something rather than nothing is found [*]. But that's not
>>> the case, shmem_radix_tree_replace returns either 0 or -ENOENT.
>>>
>>> So if error != -ENOENT, that means shmem_add_to_page_cache went the
>>> radix_tree_insert route instead of the shmem_radix_tree_replace route;
>>> which means that its 'expected' is NULL, so swp_to_radix_entry(swap)
>>> is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt
>>> the radix_tree might be, I do not understand the new VM_BUG_ON firing.
>>>
>>> Please tell me it was the wrong kernel!
>>> Hugh
>>>
>>> [*] But in thinking it over, I realize that if shmem_radix_tree_replace
>>> had returned -EEXIST for the "wrong something" case, I would have been
>>> wrong to BUG on that; because just as truncation could remove an entry,
>>> something else could immediately after instantiate a new page there.
>> Hi Hugh,
>>
>> As you said, swp_to_radix_entry() does an "| 2", so even if truncation could
>> remove an entry and something else could immediately after instantiate a new
>> page there, but the expected parameter will not be NULL, the result is
>> radix_tree_insert will not be called and shmem_add_to_page_cache will not
>> return -EEXIST, then why trigger BUG_ON ?
> Why insert the VM_BUG_ON? Because at the time I thought that it
> asserted something useful; but I was mistaken, as explained above.
>
> How can the VM_BUG_ON trigger (without stack corruption, or something
> of that kind)? I have no idea.
>
> We are in agreement: I now think that VM_BUG_ON is misleading and silly,
> and sent Andrew a further patch to remove it a just couple of hours ago.
>
> Originally I was waiting to hear further from Dave; but his test
> machine was giving trouble, and it occurred to me that, never mind
> whether he says he has hit it again, or he has not hit it again,
> the answer is the same: don't send that VM_BUG_ON upstream.
>
> Hugh

Thanks Hugh.

Another question. Why the function shmem_fallocate which you add to
kernel need call shmem_getpage?

Regards,
Jaegeuk

>
>> Regards,
>> Jaegeuk
>>
>>> So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's
>>> not saying what I had intended to say with it, and would have been
>>> wrong to say that anyway. It just looks stupid to me now, rather
>>> like inserting a VM_BUG_ON(false) - but that does become interesting
>>> when you report that you've hit it.

2012-11-15 19:56:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

Offtopic...

On Thu, 15 Nov 2012, Jaegeuk Hanse wrote:
>
> Another question. Why the function shmem_fallocate which you add to kernel
> need call shmem_getpage?

Because shmem_getpage(_gfp) is where shmem's
page lookup and allocation complexities are handled.

I assume the question behind your question is: why does shmem actually
allocate pages for its fallocate, instead of just reserving the space?

I did play with just reserving the space, with more special entries in
the radix_tree to note the reservations made. It should be doable for
the vm_enough_memory and sbinfo->used_blocks reservations.

What absolutely deterred me from taking that path was the mem_cgroup
case: shmem and swap and memcg are not easy to get working right together,
and nobody would thank me for complicating memcg just for shmem_fallocate.

By allocating pages, the pre-existing memcg code just works; if we used
reservations instead, we would have to track their memcg charges in some
additional new way. I see no justification for that complication.

Hugh

2012-11-16 00:40:23

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On 11/16/2012 03:56 AM, Hugh Dickins wrote:
> Offtopic...
>
> On Thu, 15 Nov 2012, Jaegeuk Hanse wrote:
>> Another question. Why the function shmem_fallocate which you add to kernel
>> need call shmem_getpage?
> Because shmem_getpage(_gfp) is where shmem's
> page lookup and allocation complexities are handled.
>
> I assume the question behind your question is: why does shmem actually
> allocate pages for its fallocate, instead of just reserving the space?

Yeah, this is what I want to know.

>
> I did play with just reserving the space, with more special entries in
> the radix_tree to note the reservations made. It should be doable for
> the vm_enough_memory and sbinfo->used_blocks reservations.
>
> What absolutely deterred me from taking that path was the mem_cgroup
> case: shmem and swap and memcg are not easy to get working right together,
> and nobody would thank me for complicating memcg just for shmem_fallocate.
>
> By allocating pages, the pre-existing memcg code just works; if we used
> reservations instead, we would have to track their memcg charges in some
> additional new way. I see no justification for that complication.

Oh, I see, thanks Hugh. :-)

> Hugh

2012-11-16 09:34:31

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On 11/16/2012 03:56 AM, Hugh Dickins wrote:
> Offtopic...
>
> On Thu, 15 Nov 2012, Jaegeuk Hanse wrote:
>> Another question. Why the function shmem_fallocate which you add to kernel
>> need call shmem_getpage?
> Because shmem_getpage(_gfp) is where shmem's
> page lookup and allocation complexities are handled.
>
> I assume the question behind your question is: why does shmem actually
> allocate pages for its fallocate, instead of just reserving the space?
>
> I did play with just reserving the space, with more special entries in
> the radix_tree to note the reservations made. It should be doable for
> the vm_enough_memory and sbinfo->used_blocks reservations.
>
> What absolutely deterred me from taking that path was the mem_cgroup
> case: shmem and swap and memcg are not easy to get working right together,
> and nobody would thank me for complicating memcg just for shmem_fallocate.
>
> By allocating pages, the pre-existing memcg code just works; if we used
> reservations instead, we would have to track their memcg charges in some
> additional new way. I see no justification for that complication.

Hi Hugh

Some questions about your shmem/tmpfs: misc and fallocate patchset.

- Since shmem_setattr can truncate tmpfs files, why need add another
similar codes in function shmem_fallocate? What's the trick?
- in tmpfs: support fallocate preallocation patch changelog:
"Christoph Hellwig: What for exactly? Please explain why
preallocating on tmpfs would make any sense.
Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on
files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS
[or -EOPNOTSUPP] on fallocate is just ugly."
Could shmem/tmpfs fallocate prevent one process truncate the file
which the second process mmap() and get SIGBUS when the second process
access mmap but out of current size of file?

Regards,
Jaegeuk

> Hugh

2012-11-17 04:48:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

Further offtopic..

On Fri, 16 Nov 2012, Jaegeuk Hanse wrote:
> Some questions about your shmem/tmpfs: misc and fallocate patchset.
>
> - Since shmem_setattr can truncate tmpfs files, why need add another similar
> codes in function shmem_fallocate? What's the trick?

I don't know if I understand you. In general, hole-punching is different
from truncation. Supporting the hole-punch mode of the fallocate system
call is different from supporting truncation. They're closely related,
and share code, but meet different specifications.

> - in tmpfs: support fallocate preallocation patch changelog:
> "Christoph Hellwig: What for exactly? Please explain why preallocating on
> tmpfs would make any sense.
> Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on
> the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or
> -EOPNOTSUPP] on fallocate is just ugly."
> Could shmem/tmpfs fallocate prevent one process truncate the file which the
> second process mmap() and get SIGBUS when the second process access mmap but
> out of current size of file?

Again, I don't know if I understand you. fallocate does not prevent
truncation or races or SIGBUS. I believe that Kay meant that without
using fallocate to allocate the memory in advance, systemd found it hard
to protect itself from the possibility of getting a SIGBUS, if access to
a shmem mapping happened to run out of memory/space in the middle.

I never grasped why writing the file in advance was not good enough:
fallocate happened to be what they hoped to use, and it was hard to
deny it, given that tmpfs already supported hole-punching, and was
about to convert to the fallocate interface for that.

Hugh

2012-11-18 00:58:10

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On 11/17/2012 12:48 PM, Hugh Dickins wrote:
> Further offtopic..

Thanks for your explanation, Hugh. :-)

>
> On Fri, 16 Nov 2012, Jaegeuk Hanse wrote:
>> Some questions about your shmem/tmpfs: misc and fallocate patchset.
>>
>> - Since shmem_setattr can truncate tmpfs files, why need add another similar
>> codes in function shmem_fallocate? What's the trick?
> I don't know if I understand you. In general, hole-punching is different
> from truncation. Supporting the hole-punch mode of the fallocate system
> call is different from supporting truncation. They're closely related,
> and share code, but meet different specifications.

What's the different between shmem/tmpfs hole-punching and
truncate_setsize/truncate_pagecache?
Do you mean one is punch hole in the file and the other one is shrink or
extent the size of a file?

>> - in tmpfs: support fallocate preallocation patch changelog:
>> "Christoph Hellwig: What for exactly? Please explain why preallocating on
>> tmpfs would make any sense.
>> Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on
>> the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or
>> -EOPNOTSUPP] on fallocate is just ugly."
>> Could shmem/tmpfs fallocate prevent one process truncate the file which the
>> second process mmap() and get SIGBUS when the second process access mmap but
>> out of current size of file?
> Again, I don't know if I understand you. fallocate does not prevent
> truncation or races or SIGBUS. I believe that Kay meant that without
> using fallocate to allocate the memory in advance, systemd found it hard
> to protect itself from the possibility of getting a SIGBUS, if access to
> a shmem mapping happened to run out of memory/space in the middle.

IIUC, it will return VM_xxx_OOM instead of SIGBUS if run out of memory.
Then how can get SIGBUS in this scene?

Regards,
Jaegeuk

> I never grasped why writing the file in advance was not good enough:
> fallocate happened to be what they hoped to use, and it was hard to
> deny it, given that tmpfs already supported hole-punching, and was
> about to convert to the fallocate interface for that.
> Hugh

2012-11-18 01:48:23

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

On 11/17/2012 12:48 PM, Hugh Dickins wrote:
> Further offtopic..

Hi Hugh,

- I see you add this in vfs.txt:
+ fallocate: called by the VFS to preallocate blocks or punch a hole.
I want to know if it's necessary to add it to man page since users
still don't know fallocate can punch a hole from man fallocate.
- in function shmem_fallocate:
+ else if (shmem_falloc.nr_unswapped >
shmem_falloc.nr_falloced)
+ error = -ENOMEM;
If this changelog "shmem_fallocate() compare counts and give up once the
reactivated pages have started to coming back to writepage
(approximately: some zones would in fact recycle faster than others)."
describe why need this change? If the answer is yes, I have two questions.
1) how can guarantee it really don't need preallocation if just one or a
few pages always reactivated, in this scene, nr_unswapped maybe grow
bigger enough than shmem_falloc.nr_falloced
2) why return -ENOMEM, it's not really OOM, is it a trick or ...?

Regards,
Jaegeuk

>
> On Fri, 16 Nov 2012, Jaegeuk Hanse wrote:
>> Some questions about your shmem/tmpfs: misc and fallocate patchset.
>>
>> - Since shmem_setattr can truncate tmpfs files, why need add another similar
>> codes in function shmem_fallocate? What's the trick?
> I don't know if I understand you. In general, hole-punching is different
> from truncation. Supporting the hole-punch mode of the fallocate system
> call is different from supporting truncation. They're closely related,
> and share code, but meet different specifications.
>
>> - in tmpfs: support fallocate preallocation patch changelog:
>> "Christoph Hellwig: What for exactly? Please explain why preallocating on
>> tmpfs would make any sense.
>> Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on
>> the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or
>> -EOPNOTSUPP] on fallocate is just ugly."
>> Could shmem/tmpfs fallocate prevent one process truncate the file which the
>> second process mmap() and get SIGBUS when the second process access mmap but
>> out of current size of file?
> Again, I don't know if I understand you. fallocate does not prevent
> truncation or races or SIGBUS. I believe that Kay meant that without
> using fallocate to allocate the memory in advance, systemd found it hard
> to protect itself from the possibility of getting a SIGBUS, if access to
> a shmem mapping happened to run out of memory/space in the middle.
>
> I never grasped why writing the file in advance was not good enough:
> fallocate happened to be what they hoped to use, and it was hard to
> deny it, given that tmpfs already supported hole-punching, and was
> about to convert to the fallocate interface for that.
>
> Hugh