LinuxLists.cc - unused swap offset / bad page map.

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

Hello Dave

On Wed, Aug 7, 2013 at 1:51 PM, Dave Jones <[email protected]> wrote:
> Seen while fuzzing with lots of child processes.
>
> swap_free: Unused swap offset entry 001263f5
> BUG: Bad page map in process trinity-child29 pte:24c7ea00 pmd:09fec067
> addr:00007f9db958d000 vm_flags:00100073 anon_vma:ffff88022c004ba0 mapping: (null) index:f99
> Modules linked in: fuse ipt_ULOG snd_seq_dummy tun sctp scsi_transport_iscsi can_raw can_bcm rfcomm bnep nfnetlink hidp appletalk bluetooth rose can af_802154 phonet x25 af_rxrpc llc2 nfc rfkill af_key pppoe rds pppox ppp_generic slhc caif_socket caif irda crc_ccitt atm netrom ax25 ipx p8023 psnap p8022 llc snd_hda_codec_realtek pcspkr usb_debug snd_seq snd_seq_device snd_hda_intel snd_hda_codec snd_hwdep e1000e snd_pcm ptp pps_core snd_page_alloc snd_timer snd soundcore xfs libcrc32c
> CPU: 1 PID: 2624 Comm: trinity-child29 Not tainted 3.11.0-rc4+ #1
> 0000000000000000 ffff8801fd7ddc90 ffffffff81700f2c 00007f9db958d000
> ffff8801fd7ddcd8 ffffffff8117cba7 0000000024c7ea00 0000000000000f99
> 00007f9db9600000 ffff880009fecc68 0000000024c7ea00 ffff8801fd7dde00
> Call Trace:
> [<ffffffff81700f2c>] dump_stack+0x4e/0x82
> [<ffffffff8117cba7>] print_bad_pte+0x187/0x220
> [<ffffffff8117e415>] unmap_single_vma+0x535/0x890
> [<ffffffff8117f719>] unmap_vmas+0x49/0x90
> [<ffffffff81187ef1>] exit_mmap+0xc1/0x170
> [<ffffffff810510ef>] mmput+0x6f/0x100
> [<ffffffff81055818>] do_exit+0x288/0xcd0
> [<ffffffff810c1da5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> [<ffffffff810c1e7d>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff810575dc>] do_group_exit+0x4c/0xc0
> [<ffffffff81057664>] SyS_exit_group+0x14/0x20
> [<ffffffff81713dd4>] tracesys+0xdd/0xe2
>
> There were a slew of these. same trace, different addr/anon_vma/index.
> mapping always null.
>
Would you please run again with the debug info added?
---
--- a/mm/swapfile.c Wed Aug 7 17:27:22 2013
+++ b/mm/swapfile.c Wed Aug 7 17:57:20 2013
@@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf
{
struct swap_info_struct *p;
unsigned long offset, type;
+ int race = 0;

if (!entry.val)
goto out;
@@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf
if (!p->swap_map[offset])
goto bad_free;
spin_lock(&p->lock);
+ if (!p->swap_map[offset]) {
+ race = 1;
+ spin_unlock(&p->lock);
+ goto bad_free;
+ }
return p;

bad_free:
printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
+ if (race)
+ printk(KERN_ERR "but due to race\n");
goto out;
bad_offset:
printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
--

2013-08-07 15:30:45

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Wed, Aug 07, 2013 at 06:04:20PM +0800, Hillf Danton wrote:
> > There were a slew of these. same trace, different addr/anon_vma/index.
> > mapping always null.
> >
> Would you please run again with the debug info added?
> ---
> --- a/mm/swapfile.c Wed Aug 7 17:27:22 2013
> +++ b/mm/swapfile.c Wed Aug 7 17:57:20 2013
> @@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf
> {
> struct swap_info_struct *p;
> unsigned long offset, type;
> + int race = 0;
>
> if (!entry.val)
> goto out;
> @@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf
> if (!p->swap_map[offset])
> goto bad_free;
> spin_lock(&p->lock);
> + if (!p->swap_map[offset]) {
> + race = 1;
> + spin_unlock(&p->lock);
> + goto bad_free;
> + }
> return p;
>
> bad_free:
> printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val);
> + if (race)
> + printk(KERN_ERR "but due to race\n");
> goto out;
> bad_offset:
> printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val);
> --

printk didn't trigger.
This time around the oom killer was going off the same time.
I'm wondering if we have some allocations somewhere in the swap code that
don't handle failure correctly.

Dave

2013-08-07 15:54:28

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

void __lru_cache_add(struct page *page)
{
struct pagevec *pvec = &get_cpu_var(lru_add_pvec);

page_cache_get(page);
if (!pagevec_space(pvec))
__pagevec_lru_add(pvec);
pagevec_add(pvec, page);
put_cpu_var(lru_add_pvec);
}

I added a printk, and found that pagevec_add frequently returns 0. Is that ok ?

What happens to 'page' in this case ?

Dave

2013-08-08 15:20:30

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones <[email protected]> wrote:
> printk didn't trigger.
>
Is a corrupted page table entry encountered, according to the
comment of swap_duplicate()?

--- a/mm/swapfile.c Wed Aug 7 17:27:22 2013
+++ b/mm/swapfile.c Thu Aug 8 23:12:30 2013
@@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
unlock_page(page);
page_cache_release(page);
}
+ return 1;
return p != NULL;
}

--

2013-08-08 15:36:27

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

2013-08-19 23:18:50

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Thu, Aug 08, 2013 at 11:20:28PM +0800, Hillf Danton wrote:
> On Wed, Aug 7, 2013 at 11:30 PM, Dave Jones <[email protected]> wrote:
> > printk didn't trigger.
> >
> Is a corrupted page table entry encountered, according to the
> comment of swap_duplicate()?
>
>
> --- a/mm/swapfile.c Wed Aug 7 17:27:22 2013
> +++ b/mm/swapfile.c Thu Aug 8 23:12:30 2013
> @@ -770,6 +770,7 @@ int free_swap_and_cache(swp_entry_t entr
> unlock_page(page);
> page_cache_release(page);
> }
> + return 1;
> return p != NULL;
> }
>
> --

[sorry for delay, been travelling]

With this applied, I no longer see the 'bad page' warning, but
I do still get a bunch of messages like..

[ 340.342436] swap_free: Unused swap offset entry 00003bb4
[ 340.952980] swap_free: Unused swap offset entry 0000298d
[ 340.953016] swap_free: Unused swap offset entry 00002996
[ 340.953048] swap_free: Unused swap offset entry 0000299d

btw, anyone have thoughts on a patch something like below ?
It's really annoying to debug stuff like this and have to walk
over to the machine and reboot it by hand after it wedges during swapoff.

Dave

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6cf2e60..bbb1192 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1587,6 +1587,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;

+ /* If we have hit memory corruption, we could hang during swapoff, so don't even try. */
+ if (test_taint(TAINT_BAD_PAGE))
+ return -EINVAL;
+
BUG_ON(!current->mm);

pathname = getname(specialfile);

2013-08-20 04:39:08

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 20, 2013 at 7:18 AM, Dave Jones <[email protected]> wrote:
>
> btw, anyone have thoughts on a patch something like below ?

And another(sorry if message is reformatted by the mail agent,
and it took my an hour to get the agent back to the correct format but failed,
and thanks a lot for any howto send plain text message).

Hillf

--- a/mm/memory.c Wed Aug 7 16:29:34 2013
+++ b/mm/memory.c Tue Aug 20 11:13:06 2013
@@ -933,8 +933,10 @@ again:
if (progress >= 32) {
progress = 0;
if (need_resched() ||
- spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
+ spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) {
+ BUG_ON(entry.val);
break;
+ }
}
if (pte_none(*src_pte)) {
progress++;
--

> It's really annoying to debug stuff like this and have to walk
> over to the machine and reboot it by hand after it wedges during swapoff.
>
> Dave
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6cf2e60..bbb1192 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1587,6 +1587,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> + /* If we have hit memory corruption, we could hang during swapoff, so don't even try. */
> + if (test_taint(TAINT_BAD_PAGE))
> + return -EINVAL;
> +
> BUG_ON(!current->mm);
>
> pathname = getname(specialfile);

2013-08-21 20:49:13

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 20, 2013 at 12:39:05PM +0800, Hillf Danton wrote:
> On Tue, Aug 20, 2013 at 7:18 AM, Dave Jones <[email protected]> wrote:
>
> --- a/mm/memory.c Wed Aug 7 16:29:34 2013
> +++ b/mm/memory.c Tue Aug 20 11:13:06 2013
> @@ -933,8 +933,10 @@ again:
> if (progress >= 32) {
> progress = 0;
> if (need_resched() ||
> - spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> + spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) {
> + BUG_ON(entry.val);
> break;
> + }
> }
> if (pte_none(*src_pte)) {
> progress++;

didn't hit the bug_on, but got a bunch of

[ 424.077993] swap_free: Unused swap offset entry 000187d5
[ 439.377194] swap_free: Unused swap offset entry 000187e7
[ 441.998411] swap_free: Unused swap offset entry 000187ee
[ 446.956551] swap_free: Unused swap offset entry 0000245f

Dave

2013-08-22 00:35:12

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

2013-08-22 03:21:30

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones <[email protected]> wrote:
>
> didn't hit the bug_on, but got a bunch of
>
> [ 424.077993] swap_free: Unused swap offset entry 000187d5
> [ 439.377194] swap_free: Unused swap offset entry 000187e7
> [ 441.998411] swap_free: Unused swap offset entry 000187ee
> [ 446.956551] swap_free: Unused swap offset entry 0000245f
>
If page is reused, its swap entry is freed.

reuse_swap_page()
delete_from_swap_cache()
swapcache_free()
count = swap_entry_free(p, entry, SWAP_HAS_CACHE);

If count drops to zero, then swap_free() gives warning.

--- a/mm/memory.c Wed Aug 7 16:29:34 2013
+++ b/mm/memory.c Thu Aug 22 10:44:32 2013
@@ -3123,6 +3123,7 @@ static int do_swap_page(struct mm_struct
/* It's better to call commit-charge after rmap is established */
mem_cgroup_commit_charge_swapin(page, ptr);

+ if (!exclusive)
swap_free(entry);
if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
--

2013-08-23 03:21:40

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Thu, Aug 22, 2013 at 11:21:28AM +0800, Hillf Danton wrote:
> On Thu, Aug 22, 2013 at 4:49 AM, Dave Jones <[email protected]> wrote:
> >
> > didn't hit the bug_on, but got a bunch of
> >
> > [ 424.077993] swap_free: Unused swap offset entry 000187d5
> > [ 439.377194] swap_free: Unused swap offset entry 000187e7
> > [ 441.998411] swap_free: Unused swap offset entry 000187ee
> > [ 446.956551] swap_free: Unused swap offset entry 0000245f
> >
> If page is reused, its swap entry is freed.
>
> reuse_swap_page()
> delete_from_swap_cache()
> swapcache_free()
> count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
>
> If count drops to zero, then swap_free() gives warning.
>
>
> --- a/mm/memory.c Wed Aug 7 16:29:34 2013
> +++ b/mm/memory.c Thu Aug 22 10:44:32 2013
> @@ -3123,6 +3123,7 @@ static int do_swap_page(struct mm_struct
> /* It's better to call commit-charge after rmap is established */
> mem_cgroup_commit_charge_swapin(page, ptr);
>
> + if (!exclusive)
> swap_free(entry);
> if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> try_to_free_swap(page);
> --

I still see the swap_free messages with this applied.

Dave

2013-08-23 03:53:55

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Fri, Aug 23, 2013 at 11:27:29AM +0800, Hillf Danton wrote:
> On Fri, Aug 23, 2013 at 11:21 AM, Dave Jones <[email protected]> wrote:
> >
> > I still see the swap_free messages with this applied.
> >
> Decremented?

It actually seems worse, seems I can trigger it even easier now, as if
there's a leak.

Dave

2013-08-26 03:45:55

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones <[email protected]> wrote:
>
> It actually seems worse, seems I can trigger it even easier now, as if
> there's a leak.
>
Can you please try the new fix for TLB flush?

commit 2b047252d087be7f2ba
Fix TLB gather virtual address range invalidation corner cases

2013-08-26 19:08:36

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
> On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones <[email protected]> wrote:
> >
> > It actually seems worse, seems I can trigger it even easier now, as if
> > there's a leak.
> >
> Can you please try the new fix for TLB flush?
>
> commit 2b047252d087be7f2ba
> Fix TLB gather virtual address range invalidation corner cases

No luck.

[ 4588.541886] swap_free: Unused swap offset entry 00002d15
[ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067
[ 4588.541979] addr:00007f0e95fa8000 vm_flags:00100073 anon_vma:ffff880217665550 mapping: (null) index:1a42
[ 4588.542011] Modules linked in: snd_seq_dummy fuse hidp bnep scsi_transport_iscsi rfcomm ipt_ULOG can_bcm can_raw nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc bluetooth rfkill can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 xfs libcrc32c snd_hda_codec_realtek snd_hda_intel e1000e snd_hda_codec snd_hwdep ptp snd_seq snd_seq_device snd_pcm usb_debug pps_core pcspkr snd_page_alloc snd_timer snd soundcore
[ 4588.542245] CPU: 2 PID: 25390 Comm: trinity-kid12 Not tainted 3.11.0-rc7+ #13
[ 4588.542321] 0000000000000000 ffff88021ba33c98 ffffffff816f9ddf 00007f0e95fa8000
[ 4588.542354] ffff88021ba33ce0 ffffffff81177047 00000000005a2a80 0000000000001a42
[ 4588.542386] 00007f0e96000000 ffff88022c01fd40 00000000005a2a80 ffff88021ba33e00
[ 4588.542418] Call Trace:
[ 4588.542435] [<ffffffff816f9ddf>] dump_stack+0x54/0x74
[ 4588.542457] [<ffffffff81177047>] print_bad_pte+0x187/0x220
[ 4588.542478] [<ffffffff81178874>] unmap_single_vma+0x524/0x850
[ 4588.542500] [<ffffffff81179ac9>] unmap_vmas+0x49/0x90
[ 4588.542521] [<ffffffff811822c5>] exit_mmap+0xc5/0x170
[ 4588.542542] [<ffffffff8104ffb7>] mmput+0x77/0x100
[ 4588.542562] [<ffffffff8105465d>] do_exit+0x28d/0xcd0
[ 4588.542583] [<ffffffff810c0085>] ? trace_hardirqs_on_caller+0x115/0x1e0
[ 4588.542607] [<ffffffff810c015d>] ? trace_hardirqs_on+0xd/0x10
[ 4588.542629] [<ffffffff8105643c>] do_group_exit+0x4c/0xc0
[ 4588.543534] [<ffffffff810564c4>] SyS_exit_group+0x14/0x20
[ 4588.544438] [<ffffffff8170d554>] tracesys+0xdd/0xe2

I can reproduce this pretty quickly by driving the system into swapping using
a few instances of 'trinity -C64' (this creates 64 threads)

I'm not sure how far back this bug goes, so I'll try some older kernels
and see if I can bisect it, because we don't seem to be getting closer
to figuring out what's actually happening..

Dave

2013-08-26 20:16:02

by Linus Torvalds

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 12:08 PM, Dave Jones <[email protected]> wrote:
>
> [ 4588.541886] swap_free: Unused swap offset entry 00002d15
> [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067
>
> I can reproduce this pretty quickly by driving the system into swapping using
> a few instances of 'trinity -C64' (this creates 64 threads)
>
> I'm not sure how far back this bug goes, so I'll try some older kernels
> and see if I can bisect it, because we don't seem to be getting closer
> to figuring out what's actually happening..

Bisecting would indeed be good. But I get the feeling that you'll need
to go back a *long* time, because the swap_map[] code hasn't changed
in ages.

I'm adding Hugh Dickins to the cc just in case he hasn't seen this on
linux-mm, because the swap_map[] code is complex as hell, and Hugh did
touch some of it last. The whole swap_map[] thing is complicated by:

- it's a single byte per swap entry
- it's not even a *structured* byte, but a single counter that has
several "fields" by hand
- it has a count in the low 6 bits, with a magic "bad" value (which
is also a magic "continuation" value if one of the high bits are set)
- it has two magic bits: HAS_CACHE and CONTINUED
- it has a _third_ magic value (SWAP_MAP_SHMEM) which is "CONTINUED+BAD"
- we increment this nasty pseudo-counter wildly hackily, and and have
magic special case checks for the odd cases

and if we get any of the special cases wrong, we'll
increment/decrement it wrong, and we're screwed.

The *locking* looks pretty simple, though. It's a simple spinlock. We
do some optimistic tests outside the spinlock, but the actual
allocation and modification seem to all be inside the lock and
re-check any optimistic values afaik.

So I'm almost likely to think that we are more likely to have
something wrong in the messy magical special cases. I'm wondering if
we should get rid of the continuation crap, for example, and expand
the "one byte per swap page" to two bytes instead.

Hugh, I think you know this code best, because you added the last
special case (that SWAP_MAP_SHMEM value). Comments?

Linus

2013-08-26 20:18:51

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 03:08:22PM -0400, Dave Jones wrote:
> On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
> > On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones <[email protected]> wrote:
> > >
> > > It actually seems worse, seems I can trigger it even easier now, as if
> > > there's a leak.
> > >
> > Can you please try the new fix for TLB flush?
> >
> > commit 2b047252d087be7f2ba
> > Fix TLB gather virtual address range invalidation corner cases
>
> No luck.

Hi Dave, could you please put your .config somewhere so i would try
to repeat this problem? (i've tried trinity with -C64 but it didn't
trigger the issue).

2013-08-26 20:37:16

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 27, 2013 at 12:18:46AM +0400, Cyrill Gorcunov wrote:
> On Mon, Aug 26, 2013 at 03:08:22PM -0400, Dave Jones wrote:
> > On Mon, Aug 26, 2013 at 11:45:53AM +0800, Hillf Danton wrote:
> > > On Fri, Aug 23, 2013 at 11:53 AM, Dave Jones <[email protected]> wrote:
> > > >
> > > > It actually seems worse, seems I can trigger it even easier now, as if
> > > > there's a leak.
> > > >
> > > Can you please try the new fix for TLB flush?
> > >
> > > commit 2b047252d087be7f2ba
> > > Fix TLB gather virtual address range invalidation corner cases
> >
> > No luck.
>
> Hi Dave, could you please put your .config somewhere so i would try
> to repeat this problem? (i've tried trinity with -C64 but it didn't
> trigger the issue).

http://paste.fedoraproject.org/34944/77549285
machine I'm using has 8gb ram, 8gb swap, and 4 cores.

Try adding the -C64 to the invocation in scripts/test-multi.sh,
and perhaps up'ing the NR_PROCESSES variable there too.

Dave

2013-08-26 20:42:10

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
>
> Try adding the -C64 to the invocation in scripts/test-multi.sh,
> and perhaps up'ing the NR_PROCESSES variable there too.

Thanks! I'll ping you if I manage to crash my instance.

2013-08-26 20:46:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 1:15 PM, Linus Torvalds
<[email protected]> wrote:
>
> So I'm almost likely to think that we are more likely to have
> something wrong in the messy magical special cases.

Of course, the good news would be if it actually ends up being the
soft-dirty stuff, and bisection hits something recent.

So maybe I'm overly pessimistic. That messy swap_map[] code really
_is_ messy, but at the same time it should also be pretty well-tested.
I don't think it's been touched in years.

That said, google does find "swap_free: Unused swap offset entry"
reports from over the years. Most of them seem to be single-bit
errors, though (ie when the entry is 00000100 or similar I'm more
inclined to blame a bit error - in contrast your values look like
"real" swap entries).

Linus

2013-08-26 21:37:58

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 27, 2013 at 12:42:03AM +0400, Cyrill Gorcunov wrote:
> On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
> >
> > Try adding the -C64 to the invocation in scripts/test-multi.sh,
> > and perhaps up'ing the NR_PROCESSES variable there too.
>
> Thanks! I'll ping you if I manage to crash my instance.

So trinity tained kernel, but definitely not in place I'm interested.

[ 320.904506] raw_sendmsg: trinity-child14 forgot to set AF_INET. Fix it!
[ 329.570812] ------------[ cut here ]------------
[ 329.571650] WARNING: CPU: 0 PID: 1982 at kernel/lockdep.c:3552 check_flags+0x18a/0x1c1()
[ 329.571650] DEBUG_LOCKS_WARN_ON(current->softirqs_enabled)
[ 329.571650] Modules linked in:
[ 329.571650] CPU: 0 PID: 1982 Comm: trinity-child4 Not tainted 3.11.0-rc6-dirty #386
[ 329.571650] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 329.571650] 0000000000000009 ffff88001ee03b10 ffffffff8157ac8a 0000000000000006
[ 329.571650] ffff88001ee03b60 ffff88001ee03b50 ffffffff81045bb2 ffffffff81583840
[ 329.571650] ffffffff81092620 ffff880002b48000 0000000000000046 ffffffff81a2f750
[ 329.571650] Call Trace:
[ 329.571650] <IRQ> [<ffffffff8157ac8a>] dump_stack+0x4f/0x84
[ 329.571650] [<ffffffff81045bb2>] warn_slowpath_common+0x81/0x9b
[ 329.571650] [<ffffffff81583840>] ? ftrace_call+0x5/0x2f
[ 329.571650] [<ffffffff81092620>] ? check_flags+0x18a/0x1c1
[ 329.571650] [<ffffffff81045c6f>] warn_slowpath_fmt+0x46/0x48
[ 329.571650] [<ffffffff81045c2e>] ? warn_slowpath_fmt+0x5/0x48
[ 329.571650] [<ffffffff81092620>] check_flags+0x18a/0x1c1
[ 329.571650] [<ffffffff81093595>] lock_is_held+0x30/0x5f
[ 329.571650] [<ffffffff810eb19e>] rcu_read_lock_held+0x36/0x38
[ 329.571650] [<ffffffff810f1b92>] perf_tp_event+0x92/0x220
[ 329.571650] [<ffffffff810f1d0e>] ? perf_tp_event+0x20e/0x220
[ 329.571650] [<ffffffff81049f6c>] ? __local_bh_enable+0x9a/0x9e
[ 329.571650] [<ffffffff810712f3>] ? get_parent_ip+0x3f/0x3f
[ 329.571650] [<ffffffff81049f6c>] ? __local_bh_enable+0x9a/0x9e
[ 329.571650] [<ffffffff810e3af1>] perf_ftrace_function_call+0xce/0xdc

...

(since my config pretty similar to yours I tried to run trinity without
kernel recompilation. At first i loaded swap space with crap data

[root@ovz trinity]# free
total used free shared buffers cached
Mem: 493228 480188 13040 0 2912 12112
-/+ buffers/cache: 465164 28064
Swap: 2063356 1741304 322052

then run it as

[root@ovz trinity]# ./trinity -C64 --dangerous)

I'll continue tomorrow with your config and test-multi.sh.

2013-08-26 21:42:57

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 27, 2013 at 01:37:54AM +0400, Cyrill Gorcunov wrote:
> On Tue, Aug 27, 2013 at 12:42:03AM +0400, Cyrill Gorcunov wrote:
> > On Mon, Aug 26, 2013 at 04:37:02PM -0400, Dave Jones wrote:
> > >
> > > Try adding the -C64 to the invocation in scripts/test-multi.sh,
> > > and perhaps up'ing the NR_PROCESSES variable there too.
> >
> > Thanks! I'll ping you if I manage to crash my instance.
>
> So trinity tained kernel, but definitely not in place I'm interested.
>
> [ 320.904506] raw_sendmsg: trinity-child14 forgot to set AF_INET. Fix it!
> [ 329.570812] ------------[ cut here ]------------
> [ 329.571650] WARNING: CPU: 0 PID: 1982 at kernel/lockdep.c:3552 check_flags+0x18a/0x1c1()
> [ 329.571650] DEBUG_LOCKS_WARN_ON(current->softirqs_enabled)
> [ 329.571650] Modules linked in:
> [ 329.571650] CPU: 0 PID: 1982 Comm: trinity-child4 Not tainted 3.11.0-rc6-dirty #386
> [ 329.571650] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 329.571650] 0000000000000009 ffff88001ee03b10 ffffffff8157ac8a 0000000000000006
> [ 329.571650] ffff88001ee03b60 ffff88001ee03b50 ffffffff81045bb2 ffffffff81583840
> [ 329.571650] ffffffff81092620 ffff880002b48000 0000000000000046 ffffffff81a2f750
> [ 329.571650] Call Trace:
> [ 329.571650] <IRQ> [<ffffffff8157ac8a>] dump_stack+0x4f/0x84
> [ 329.571650] [<ffffffff81045bb2>] warn_slowpath_common+0x81/0x9b
> [ 329.571650] [<ffffffff81583840>] ? ftrace_call+0x5/0x2f
> [ 329.571650] [<ffffffff81092620>] ? check_flags+0x18a/0x1c1
> [ 329.571650] [<ffffffff81045c6f>] warn_slowpath_fmt+0x46/0x48
> [ 329.571650] [<ffffffff81045c2e>] ? warn_slowpath_fmt+0x5/0x48
> [ 329.571650] [<ffffffff81092620>] check_flags+0x18a/0x1c1
> [ 329.571650] [<ffffffff81093595>] lock_is_held+0x30/0x5f
> [ 329.571650] [<ffffffff810eb19e>] rcu_read_lock_held+0x36/0x38
> [ 329.571650] [<ffffffff810f1b92>] perf_tp_event+0x92/0x220
> [ 329.571650] [<ffffffff810f1d0e>] ? perf_tp_event+0x20e/0x220
> [ 329.571650] [<ffffffff81049f6c>] ? __local_bh_enable+0x9a/0x9e
> [ 329.571650] [<ffffffff810712f3>] ? get_parent_ip+0x3f/0x3f
> [ 329.571650] [<ffffffff81049f6c>] ? __local_bh_enable+0x9a/0x9e
> [ 329.571650] [<ffffffff810e3af1>] perf_ftrace_function_call+0xce/0xdc

when it rains, it pours..

> (since my config pretty similar to yours I tried to run trinity without
> kernel recompilation. At first i loaded swap space with crap data
>
> [root@ovz trinity]# free
> total used free shared buffers cached
> Mem: 493228 480188 13040 0 2912 12112
> -/+ buffers/cache: 465164 28064
> Swap: 2063356 1741304 322052
>
> then run it as
>
> [root@ovz trinity]# ./trinity -C64 --dangerous)

Yeah, for reproducing this bug, I'd stick to running it as a user, without --dangerous.
you might still hit a few fairly-easy to trigger warn-on/printks. I run with
this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make things
a little less noisy.

Dave

2013-08-26 21:49:46

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 05:42:44PM -0400, Dave Jones wrote:
>
> Yeah, for reproducing this bug, I'd stick to running it as a user, without --dangerous.
> you might still hit a few fairly-easy to trigger warn-on/printks. I run with
> this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make things
> a little less noisy.

Ah, thanks, pulling it in. Btw, have you seen this problem earlier than -rc4 at all?

2013-08-26 21:59:29

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 27, 2013 at 01:49:40AM +0400, Cyrill Gorcunov wrote:
> On Mon, Aug 26, 2013 at 05:42:44PM -0400, Dave Jones wrote:
> >
> > Yeah, for reproducing this bug, I'd stick to running it as a user, without --dangerous.
> > you might still hit a few fairly-easy to trigger warn-on/printks. I run with
> > this applied: http://paste.fedoraproject.org/34960/55323613/raw/ to make things
> > a little less noisy.
>
> Ah, thanks, pulling it in. Btw, have you seen this problem earlier than -rc4 at all?

I just hit it on 3.11rc1. Couldn't reproduce on 3.10.

Dave

2013-08-26 22:09:07

by Hugh Dickins

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, 26 Aug 2013, Linus Torvalds wrote:
> On Mon, Aug 26, 2013 at 1:15 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So I'm almost likely to think that we are more likely to have
> > something wrong in the messy magical special cases.
>
> Of course, the good news would be if it actually ends up being the
> soft-dirty stuff, and bisection hits something recent.

I suspect so.

>
> So maybe I'm overly pessimistic. That messy swap_map[] code really
> _is_ messy, but at the same time it should also be pretty well-tested.
> I don't think it's been touched in years.

Blame me for the byte-instead-of-short continuation stuff.
But it's never yet shown any problem (okay, perhaps that's
because it's so rare to need any continuation anyway).

>
> That said, google does find "swap_free: Unused swap offset entry"
> reports from over the years. Most of them seem to be single-bit
> errors, though (ie when the entry is 00000100 or similar I'm more
> inclined to blame a bit error

Yes, historically they have usually represented either single-bit
errors, or corruption of page tables by other kernel data. The
swap subsystem discovers it, but it's rarely an error of swap.

So I don't care for Dave's suggestion much earlier in this thread,
that swapoff should fail with -EINVAL if there has been a bad page
taint: that doesn't necessarily interfere with swapoff at all.

And besides, swapoff is killable: yes, if counts go wrong, it
can cycle around endlessly, but it checks for signal_pending()
each time around the loop.

> - in contrast your values look like "real" swap entries).

Indeed they do.

I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
a line in mremap which worries me. That set_pte_at() is operating
on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
prone to corrupt a swap entry.

I've not tried matching up bits with Dave's reports, and just going
into a meeting now, but this patch looks worth a try: probably Cyrill
can improve it meanwhile to what he actually wants there (I'm
surprised anything special is needed for just moving a pte).

Hugh

--- 3.11-rc7/mm/mremap.c 2013-07-14 17:10:16.640003652 -0700
+++ linux/mm/mremap.c 2013-08-26 14:46:14.460027627 -0700
@@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
- set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
+ set_pte_at(mm, new_addr, new_pte, pte);
}

arch_leave_lazy_mmu_mode();

2013-08-26 22:28:47

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 03:08:45PM -0700, Hugh Dickins wrote:

> > That said, google does find "swap_free: Unused swap offset entry"
> > reports from over the years. Most of them seem to be single-bit
> > errors, though (ie when the entry is 00000100 or similar I'm more
> > inclined to blame a bit error
>
> Yes, historically they have usually represented either single-bit
> errors, or corruption of page tables by other kernel data. The
> swap subsystem discovers it, but it's rarely an error of swap.

Just to rule out bad hardware, I've seen this on two systems
(admittedly the exact same spec, but still..)

> So I don't care for Dave's suggestion much earlier in this thread,
> that swapoff should fail with -EINVAL if there has been a bad page
> taint: that doesn't necessarily interfere with swapoff at all.
>
> And besides, swapoff is killable: yes, if counts go wrong, it
> can cycle around endlessly, but it checks for signal_pending()
> each time around the loop.

It might be killable, but if I've done /sbin/reboot, and the
kernel dies in sys_swapoff because of the corruption, I won't
get a chance to kill it, because at that point the shutdown process
has killed my shell, sshd, and just about everything else.
It mieans a grumpy walk to the other side of the house to prod a
reset button. So yeah, it might not be a mergable thing, but
at least while bisecting it's pretty much a must-have.

> I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
> a line in mremap which worries me. That set_pte_at() is operating
> on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
> prone to corrupt a swap entry.
>
> I've not tried matching up bits with Dave's reports, and just going
> into a meeting now, but this patch looks worth a try: probably Cyrill
> can improve it meanwhile to what he actually wants there (I'm
> surprised anything special is needed for just moving a pte).
>
> Hugh
>
> --- 3.11-rc7/mm/mremap.c 2013-07-14 17:10:16.640003652 -0700
> +++ linux/mm/mremap.c 2013-08-26 14:46:14.460027627 -0700
> @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
> continue;
> pte = ptep_get_and_clear(mm, old_addr, old_pte);
> pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
> - set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
> + set_pte_at(mm, new_addr, new_pte, pte);
> }

I'll give this a shot once I'm done with the bisect.

Dave

2013-08-26 23:15:03

by Linus Torvalds

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 3:08 PM, Hugh Dickins <[email protected]> wrote:
>
> I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
> a line in mremap which worries me. That set_pte_at() is operating
> on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
> prone to corrupt a swap entry.

Uhhuh. I think you hit the nail on the head here.

I checked all the pte_swp_*soft_dirty() users (they should be used on
swp entries), because that came up in another thread. But you're
right, the non-swp ones only work on present pte entries (or on
file-offset entries, I guess), and at least that mremap() case seems
bogus.

I'm not seeing the point of marking the thing soft-dirty at all,
although I guess it's "dirty" in the sense that it changed the
contents at that virtual address. But for that code to work, it would
have to have the same bit for swap entries as for present pages (and
for file mapping entries), and that's not true. They are two different
bits (_PAGE_SOFT_DIRTY is bit #11 vs _PAGE_SWP_SOFT_DIRTY is bit #7).

Ugh. Cyrill, this is a mess.

Linus

2013-08-27 05:44:34

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 04:15:00PM -0700, Linus Torvalds wrote:
> On Mon, Aug 26, 2013 at 3:08 PM, Hugh Dickins <[email protected]> wrote:
> >
> > I just did a quick diff of 3.11-rc7/mm against 3.10, and here's
> > a line in mremap which worries me. That set_pte_at() is operating
> > on anything that isn't pte_none(), so the pte_mksoft_dirty() looks
> > prone to corrupt a swap entry.
>
> Uhhuh. I think you hit the nail on the head here.
>
> I checked all the pte_swp_*soft_dirty() users (they should be used on
> swp entries), because that came up in another thread. But you're
> right, the non-swp ones only work on present pte entries (or on
> file-offset entries, I guess), and at least that mremap() case seems
> bogus.

Oh my :( Indeed it sets _PAGE_SOFT_DIRTY unconditionally, sigh. This
nit comes from former soft-dirty commit. Let me check all other places
we set soft dirty bit (Pavel CC'ed).

> I'm not seeing the point of marking the thing soft-dirty at all,
> although I guess it's "dirty" in the sense that it changed the
> contents at that virtual address. But for that code to work, it would
> have to have the same bit for swap entries as for present pages (and
> for file mapping entries), and that's not true. They are two different
> bits (_PAGE_SOFT_DIRTY is bit #11 vs _PAGE_SWP_SOFT_DIRTY is bit #7).
>
> Ugh. Cyrill, this is a mess.

Linus, I simply had no place in pte entry to carry soft-dirty status
when pte incoded in swap format, so it was unpleasant but necessary
decision. That's why bits access are wrapped in own macros with
'swp' prefix thus reader would easily grep for them.

2013-08-27 08:37:24

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Mon, Aug 26, 2013 at 06:28:33PM -0400, Dave Jones wrote:
> >
> > I've not tried matching up bits with Dave's reports, and just going
> > into a meeting now, but this patch looks worth a try: probably Cyrill
> > can improve it meanwhile to what he actually wants there (I'm
> > surprised anything special is needed for just moving a pte).
> >
> > Hugh
> >
> > --- 3.11-rc7/mm/mremap.c 2013-07-14 17:10:16.640003652 -0700
> > +++ linux/mm/mremap.c 2013-08-26 14:46:14.460027627 -0700
> > @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
> > continue;
> > pte = ptep_get_and_clear(mm, old_addr, old_pte);
> > pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
> > - set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
> > + set_pte_at(mm, new_addr, new_pte, pte);
> > }
>
> I'll give this a shot once I'm done with the bisect.

I managed to trigger the issue as well. The patch below fixes it.
Dave, could you please give it a shot once time permit?

Pavel, I kept 'make it dirty on move' logic, but i'm somehow doubt
in it, won't plain pte copying (as in Hugh's patch) work of us?
---
From: Cyrill Gorcunov <[email protected]>
Subject: [PATCH] mm: move_ptes -- Set soft dirty bit depending on pte type

Dave reported corrupted swap entries

| [ 4588.541886] swap_free: Unused swap offset entry 00002d15
| [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067

and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit
set regardless the type of entry pte consists of. The
trick here is that -- when we carry soft dirty status
in swap entries we are to use _PAGE_SWP_SOFT_DIRTY instead,
because this is the only place in pte which can be used
for own needs without intersecting with bits owned by
swap entry type/offset.

Reported-by: Dave Jones <[email protected]>
Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Andrew Morton <[email protected]>
---
mm/mremap.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

Index: linux-2.6.git/mm/mremap.c
===================================================================
--- linux-2.6.git.orig/mm/mremap.c
+++ linux-2.6.git/mm/mremap.c
@@ -15,6 +15,7 @@
#include <linux/swap.h>
#include <linux/capability.h>
#include <linux/fs.h>
+#include <linux/swapops.h>
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -69,6 +70,23 @@ static pmd_t *alloc_new_pmd(struct mm_st
return pmd;
}

+static pte_t move_soft_dirty_pte(pte_t pte)
+{
+ /*
+ * Set soft dirty bit so we can notice
+ * in userspace the ptes were moved.
+ */
+#ifdef CONFIG_MEM_SOFT_DIRTY
+ if (pte_present(pte))
+ pte = pte_mksoft_dirty(pte);
+ else if (is_swap_pte(pte))
+ pte = pte_swp_mksoft_dirty(pte);
+ else if (pte_file(pte))
+ pte = pte_file_mksoft_dirty(pte);
+#endif
+ return pte;
+}
+
static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
@@ -126,7 +144,8 @@ static void move_ptes(struct vm_area_str
continue;
pte = ptep_get_and_clear(mm, old_addr, old_pte);
pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
- set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
+ pte = move_soft_dirty_pte(pte);
+ set_pte_at(mm, new_addr, new_pte, pte);
}

arch_leave_lazy_mmu_mode();

2013-08-27 16:24:48

[permalink] [raw]

Subject: Re: unused swap offset / bad page map.

On Tue, Aug 27, 2013 at 12:37:18PM +0400, Cyrill Gorcunov wrote:
> On Mon, Aug 26, 2013 at 06:28:33PM -0400, Dave Jones wrote:
> > >
> > > I've not tried matching up bits with Dave's reports, and just going
> > > into a meeting now, but this patch looks worth a try: probably Cyrill
> > > can improve it meanwhile to what he actually wants there (I'm
> > > surprised anything special is needed for just moving a pte).
> > >
> > > Hugh
> > >
> > > --- 3.11-rc7/mm/mremap.c 2013-07-14 17:10:16.640003652 -0700
> > > +++ linux/mm/mremap.c 2013-08-26 14:46:14.460027627 -0700
> > > @@ -126,7 +126,7 @@ static void move_ptes(struct vm_area_str
> > > continue;
> > > pte = ptep_get_and_clear(mm, old_addr, old_pte);
> > > pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
> > > - set_pte_at(mm, new_addr, new_pte, pte_mksoft_dirty(pte));
> > > + set_pte_at(mm, new_addr, new_pte, pte);
> > > }
> >
> > I'll give this a shot once I'm done with the bisect.
>
> I managed to trigger the issue as well. The patch below fixes it.
> Dave, could you please give it a shot once time permit?

Seems to do the trick.

Tested-by: Dave Jones <[email protected]>

Dave

2013-08-27 16:32:39