2015-11-02 13:02:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > Hello Hugh,
> > > > > >
> > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > >
> > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > to be true.
> > > > > > > >
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > >
> > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > >
> > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > series and will test it again.
> > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > >
> > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > And I added below debug code with request from Kirill to all test kernels.
> > > >
> > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > finally.
> > > >
> > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > looks like it works.
> > > >
> > > > The problem was my wrong assumption on how migration works: I thought that
> > > > kernel would wait migration to finish on before deconstruction mapping.
> > > >
> > > > But turn out that's not true.
> > > >
> > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > with page which is not mapped anymore but has _count and _mapcount
> > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > >
> > > > BOOM!
> > > >
> > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > we get pin back and put page on rmap. This way even if migration entry
> > > > will be removed under us we don't corrupt page's state.
> > > >
> > > > Please, test.
> > > >
> > >
> > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > one I sent to you(ie, oops.c + memcg_test.sh)
> > >
> > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > page->mem_cgroup:ffff88007f613c00
> >
> > Ignore my previous answer. Still sleeping.
> >
> > The right way to fix I think is something like:
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 35643176bc15..f2d46792a554 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > bool compound = flags & RMAP_COMPOUND;
> > bool first;
> >
> > - if (PageTransCompound(page)) {
> > + if (PageTransCompound(page) && compound) {
> > + atomic_t *mapcount;
> > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > - if (compound) {
> > - atomic_t *mapcount;
> > -
> > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > - mapcount = compound_mapcount_ptr(page);
> > - first = atomic_inc_and_test(mapcount);
> > - } else {
> > - /* Anon THP always mapped first with PMD */
> > - first = 0;
> > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > - atomic_inc(&page->_mapcount);
> > - }
> > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > + mapcount = compound_mapcount_ptr(page);
> > + first = atomic_inc_and_test(mapcount);
> > } else {
> > VM_BUG_ON_PAGE(compound, page);
> > first = atomic_inc_and_test(&page->_mapcount);
> > --
>
> kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
>
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
>
> <SNIP>
>
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1

Hm. I was not able to trigger this and don't see anything obviuous what can
lead to this kind of missmatch :-/

I found one more bug: clearing of PageTail can be visible to other CPUs
before updated page->flags on the page.

I don't think this bug is connected to what you've reported, but worth
testing.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e0fe82a0fae..12bd8c5a4409 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2934,6 +2934,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
+
+ if (freeze) {
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ page_remove_rmap(page + i, false);
+ put_page(page + i);
+ }
+ }
}

void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3079,6 +3086,8 @@ static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
if (pte_soft_dirty(entry))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+ page_remove_rmap(page, false);
+ put_page(page);
}
pte_unmap_unlock(pte, ptl);
}
@@ -3117,8 +3126,6 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
return;
pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
- if (!page_mapped(page))
- continue;
if (!is_swap_pte(pte[i]))
continue;

@@ -3128,6 +3135,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
if (migration_entry_to_page(swp_entry) != page)
continue;

+ get_page(page);
+ page_add_anon_rmap(page, vma, address, false);
+
entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
entry = pte_mkdirty(entry);
if (is_write_migration_entry(swp_entry))
@@ -3181,8 +3191,6 @@ static int __split_huge_page_tail(struct page *head, int tail,
*/
atomic_add(mapcount + 1, &page_tail->_count);

- /* after clearing PageTail the gup refcount can be released */
- smp_mb__after_atomic();

page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
page_tail->flags |= (head->flags &
@@ -3195,6 +3203,12 @@ static int __split_huge_page_tail(struct page *head, int tail,
(1L << PG_unevictable)));
page_tail->flags |= (1L << PG_dirty);

+ /*
+ * After clearing PageTail the gup refcount can be released.
+ * Page flags also must be visible before we make the page non-compound.
+ */
+ smp_wmb();
+
clear_compound_head(page_tail);

if (page_is_young(head))
diff --git a/mm/rmap.c b/mm/rmap.c
index 35643176bc15..e4f8d9fb1c3d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
bool compound = flags & RMAP_COMPOUND;
bool first;

- if (PageTransCompound(page)) {
+ if (compound) {
+ atomic_t *mapcount;
VM_BUG_ON_PAGE(!PageLocked(page), page);
- if (compound) {
- atomic_t *mapcount;
-
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- mapcount = compound_mapcount_ptr(page);
- first = atomic_inc_and_test(mapcount);
- } else {
- /* Anon THP always mapped first with PMD */
- first = 0;
- VM_BUG_ON_PAGE(!page_mapcount(page), page);
- atomic_inc(&page->_mapcount);
- }
+ VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+ mapcount = compound_mapcount_ptr(page);
+ first = atomic_inc_and_test(mapcount);
} else {
VM_BUG_ON_PAGE(compound, page);
first = atomic_inc_and_test(&page->_mapcount);
@@ -1201,7 +1193,6 @@ void do_page_add_anon_rmap(struct page *page,
* disabled.
*/
if (compound) {
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
}
--
Kirill A. Shutemov


2015-11-03 03:03:08

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

Hello Kirill,

On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > Hello Hugh,
> > > > > > >
> > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > >
> > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > to be true.
> > > > > > > > >
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > >
> > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > >
> > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > series and will test it again.
> > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > >
> > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > >
> > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > finally.
> > > > >
> > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > looks like it works.
> > > > >
> > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > >
> > > > > But turn out that's not true.
> > > > >
> > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > >
> > > > > BOOM!
> > > > >
> > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > will be removed under us we don't corrupt page's state.
> > > > >
> > > > > Please, test.
> > > > >
> > > >
> > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > >
> > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > page->mem_cgroup:ffff88007f613c00
> > >
> > > Ignore my previous answer. Still sleeping.
> > >
> > > The right way to fix I think is something like:
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 35643176bc15..f2d46792a554 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > bool compound = flags & RMAP_COMPOUND;
> > > bool first;
> > >
> > > - if (PageTransCompound(page)) {
> > > + if (PageTransCompound(page) && compound) {
> > > + atomic_t *mapcount;
> > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > - if (compound) {
> > > - atomic_t *mapcount;
> > > -
> > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > - mapcount = compound_mapcount_ptr(page);
> > > - first = atomic_inc_and_test(mapcount);
> > > - } else {
> > > - /* Anon THP always mapped first with PMD */
> > > - first = 0;
> > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > - atomic_inc(&page->_mapcount);
> > > - }
> > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > + mapcount = compound_mapcount_ptr(page);
> > > + first = atomic_inc_and_test(mapcount);
> > > } else {
> > > VM_BUG_ON_PAGE(compound, page);
> > > first = atomic_inc_and_test(&page->_mapcount);
> > > --
> >
> > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> >
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> >
> > <SNIP>
> >
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
>
> Hm. I was not able to trigger this and don't see anything obviuous what can
> lead to this kind of missmatch :-/
>
> I found one more bug: clearing of PageTail can be visible to other CPUs
> before updated page->flags on the page.
>
> I don't think this bug is connected to what you've reported, but worth
> testing.

I'm happy to test but I ask one thing.
I hope you send new formal all-on-one patch instead of code snippets.
It can help to test/communicate easy and others understands current
issues and your approaches.

And please say what kernel your patch based on.

Thanks.

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5e0fe82a0fae..12bd8c5a4409 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2934,6 +2934,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>
> smp_wmb(); /* make pte visible before pmd */
> pmd_populate(mm, pmd, pgtable);
> +
> + if (freeze) {
> + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> + page_remove_rmap(page + i, false);
> + put_page(page + i);
> + }
> + }
> }
>
> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> @@ -3079,6 +3086,8 @@ static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
> if (pte_soft_dirty(entry))
> swp_pte = pte_swp_mksoft_dirty(swp_pte);
> set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
> + page_remove_rmap(page, false);
> + put_page(page);
> }
> pte_unmap_unlock(pte, ptl);
> }
> @@ -3117,8 +3126,6 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> return;
> pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
> for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
> - if (!page_mapped(page))
> - continue;
> if (!is_swap_pte(pte[i]))
> continue;
>
> @@ -3128,6 +3135,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> if (migration_entry_to_page(swp_entry) != page)
> continue;
>
> + get_page(page);
> + page_add_anon_rmap(page, vma, address, false);
> +
> entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
> entry = pte_mkdirty(entry);
> if (is_write_migration_entry(swp_entry))
> @@ -3181,8 +3191,6 @@ static int __split_huge_page_tail(struct page *head, int tail,
> */
> atomic_add(mapcount + 1, &page_tail->_count);
>
> - /* after clearing PageTail the gup refcount can be released */
> - smp_mb__after_atomic();
>
> page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> page_tail->flags |= (head->flags &
> @@ -3195,6 +3203,12 @@ static int __split_huge_page_tail(struct page *head, int tail,
> (1L << PG_unevictable)));
> page_tail->flags |= (1L << PG_dirty);
>
> + /*
> + * After clearing PageTail the gup refcount can be released.
> + * Page flags also must be visible before we make the page non-compound.
> + */
> + smp_wmb();
> +
> clear_compound_head(page_tail);
>
> if (page_is_young(head))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 35643176bc15..e4f8d9fb1c3d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> bool compound = flags & RMAP_COMPOUND;
> bool first;
>
> - if (PageTransCompound(page)) {
> + if (compound) {
> + atomic_t *mapcount;
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> - if (compound) {
> - atomic_t *mapcount;
> -
> - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> - mapcount = compound_mapcount_ptr(page);
> - first = atomic_inc_and_test(mapcount);
> - } else {
> - /* Anon THP always mapped first with PMD */
> - first = 0;
> - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> - atomic_inc(&page->_mapcount);
> - }
> + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> + mapcount = compound_mapcount_ptr(page);
> + first = atomic_inc_and_test(mapcount);
> } else {
> VM_BUG_ON_PAGE(compound, page);
> first = atomic_inc_and_test(&page->_mapcount);
> @@ -1201,7 +1193,6 @@ void do_page_add_anon_rmap(struct page *page,
> * disabled.
> */
> if (compound) {
> - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> __inc_zone_page_state(page,
> NR_ANON_TRANSPARENT_HUGEPAGES);
> }
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-11-03 07:17:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> Hello Kirill,
>
> On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > Hello Hugh,
> > > > > > > >
> > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > >
> > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > to be true.
> > > > > > > > > >
> > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > >
> > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > >
> > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > series and will test it again.
> > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > >
> > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > >
> > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > finally.
> > > > > >
> > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > looks like it works.
> > > > > >
> > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > >
> > > > > > But turn out that's not true.
> > > > > >
> > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > >
> > > > > > BOOM!
> > > > > >
> > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > will be removed under us we don't corrupt page's state.
> > > > > >
> > > > > > Please, test.
> > > > > >
> > > > >
> > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > >
> > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > page->mem_cgroup:ffff88007f613c00
> > > >
> > > > Ignore my previous answer. Still sleeping.
> > > >
> > > > The right way to fix I think is something like:
> > > >
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 35643176bc15..f2d46792a554 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > bool compound = flags & RMAP_COMPOUND;
> > > > bool first;
> > > >
> > > > - if (PageTransCompound(page)) {
> > > > + if (PageTransCompound(page) && compound) {
> > > > + atomic_t *mapcount;
> > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > - if (compound) {
> > > > - atomic_t *mapcount;
> > > > -
> > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > - mapcount = compound_mapcount_ptr(page);
> > > > - first = atomic_inc_and_test(mapcount);
> > > > - } else {
> > > > - /* Anon THP always mapped first with PMD */
> > > > - first = 0;
> > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > - atomic_inc(&page->_mapcount);
> > > > - }
> > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > + mapcount = compound_mapcount_ptr(page);
> > > > + first = atomic_inc_and_test(mapcount);
> > > > } else {
> > > > VM_BUG_ON_PAGE(compound, page);
> > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > --
> > >
> > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > >
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > >
> > > <SNIP>
> > >
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> >
> > Hm. I was not able to trigger this and don't see anything obviuous what can
> > lead to this kind of missmatch :-/

I managed to trigger this when switched back from MADV_DONTNEED to
MADV_FREE. Hm..

> > I found one more bug: clearing of PageTail can be visible to other CPUs
> > before updated page->flags on the page.
> >
> > I don't think this bug is connected to what you've reported, but worth
> > testing.
>
> I'm happy to test but I ask one thing.
> I hope you send new formal all-on-one patch instead of code snippets.
> It can help to test/communicate easy and others understands current
> issues and your approaches.

I'll post patchset with refcounting fixes today.

> And please say what kernel your patch based on.

That's on top of

https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git since-4.2

--
Kirill A. Shutemov

2015-11-03 07:33:32

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Tue, Nov 03, 2015 at 09:16:50AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> > Hello Kirill,
> >
> > On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > > Hello Hugh,
> > > > > > > > >
> > > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > > >
> > > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > > to be true.
> > > > > > > > > > >
> > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > > >
> > > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > > >
> > > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > > series and will test it again.
> > > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > > >
> > > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > > >
> > > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > > finally.
> > > > > > >
> > > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > > looks like it works.
> > > > > > >
> > > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > > >
> > > > > > > But turn out that's not true.
> > > > > > >
> > > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > > >
> > > > > > > BOOM!
> > > > > > >
> > > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > > will be removed under us we don't corrupt page's state.
> > > > > > >
> > > > > > > Please, test.
> > > > > > >
> > > > > >
> > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > > >
> > > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > > page->mem_cgroup:ffff88007f613c00
> > > > >
> > > > > Ignore my previous answer. Still sleeping.
> > > > >
> > > > > The right way to fix I think is something like:
> > > > >
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 35643176bc15..f2d46792a554 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > > bool compound = flags & RMAP_COMPOUND;
> > > > > bool first;
> > > > >
> > > > > - if (PageTransCompound(page)) {
> > > > > + if (PageTransCompound(page) && compound) {
> > > > > + atomic_t *mapcount;
> > > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > - if (compound) {
> > > > > - atomic_t *mapcount;
> > > > > -
> > > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > - mapcount = compound_mapcount_ptr(page);
> > > > > - first = atomic_inc_and_test(mapcount);
> > > > > - } else {
> > > > > - /* Anon THP always mapped first with PMD */
> > > > > - first = 0;
> > > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > > - atomic_inc(&page->_mapcount);
> > > > > - }
> > > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > + mapcount = compound_mapcount_ptr(page);
> > > > > + first = atomic_inc_and_test(mapcount);
> > > > > } else {
> > > > > VM_BUG_ON_PAGE(compound, page);
> > > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > > --
> > > >
> > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > > >
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > >
> > > > <SNIP>
> > > >
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> > >
> > > Hm. I was not able to trigger this and don't see anything obviuous what can
> > > lead to this kind of missmatch :-/
>
> I managed to trigger this when switched back from MADV_DONTNEED to
> MADV_FREE. Hm..

Hmm,,
What version of MADV_FREE do you test on?
Old MADV_FREE(ie, before posting MADV_FREE refactoring and fix KSM page)
had a bug.

I tried your patches on top of recent my MADV_FREE patches.
But when I try it with old THP refcount redesign, I couldn't find
any problem so far. However, I'm not saying it's your fault.

I will give it a shot with MADV_DONTNEED to reproduce the problem.
But one thing I could say is MADV_DONTNEED is more hard to hit
compared to MADV_FREE because memory pressure of MADV_DONTNEED test
wouldn't be heavy.

>
> > > I found one more bug: clearing of PageTail can be visible to other CPUs
> > > before updated page->flags on the page.
> > >
> > > I don't think this bug is connected to what you've reported, but worth
> > > testing.
> >
> > I'm happy to test but I ask one thing.
> > I hope you send new formal all-on-one patch instead of code snippets.
> > It can help to test/communicate easy and others understands current
> > issues and your approaches.
>
> I'll post patchset with refcounting fixes today.

Yeb, I will wait and if I get it before leaving the office,
I will queue it to test machine.

>
> > And please say what kernel your patch based on.
>
> That's on top of
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git since-4.2

I have been tested it on git://git.cmpxchg.org/linux-mmotm.git.
I guess applying your patch to hannes's tree is not a difficult.
I will continue to use hannes's mmotm.

>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-11-03 15:52:44

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Tue, Nov 03, 2015 at 04:33:29PM +0900, Minchan Kim wrote:
> On Tue, Nov 03, 2015 at 09:16:50AM +0200, Kirill A. Shutemov wrote:
> > On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> > > Hello Kirill,
> > >
> > > On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > > > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > > > Hello Hugh,
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > > > to be true.
> > > > > > > > > > > >
> > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > > > >
> > > > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > > > >
> > > > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > > > series and will test it again.
> > > > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > > > >
> > > > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > > > >
> > > > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > > > finally.
> > > > > > > >
> > > > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > > > looks like it works.
> > > > > > > >
> > > > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > > > >
> > > > > > > > But turn out that's not true.
> > > > > > > >
> > > > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > > > >
> > > > > > > > BOOM!
> > > > > > > >
> > > > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > > > will be removed under us we don't corrupt page's state.
> > > > > > > >
> > > > > > > > Please, test.
> > > > > > > >
> > > > > > >
> > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > > > >
> > > > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > > > page->mem_cgroup:ffff88007f613c00
> > > > > >
> > > > > > Ignore my previous answer. Still sleeping.
> > > > > >
> > > > > > The right way to fix I think is something like:
> > > > > >
> > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > index 35643176bc15..f2d46792a554 100644
> > > > > > --- a/mm/rmap.c
> > > > > > +++ b/mm/rmap.c
> > > > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > > > bool compound = flags & RMAP_COMPOUND;
> > > > > > bool first;
> > > > > >
> > > > > > - if (PageTransCompound(page)) {
> > > > > > + if (PageTransCompound(page) && compound) {
> > > > > > + atomic_t *mapcount;
> > > > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > > - if (compound) {
> > > > > > - atomic_t *mapcount;
> > > > > > -
> > > > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > - mapcount = compound_mapcount_ptr(page);
> > > > > > - first = atomic_inc_and_test(mapcount);
> > > > > > - } else {
> > > > > > - /* Anon THP always mapped first with PMD */
> > > > > > - first = 0;
> > > > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > > > - atomic_inc(&page->_mapcount);
> > > > > > - }
> > > > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > + mapcount = compound_mapcount_ptr(page);
> > > > > > + first = atomic_inc_and_test(mapcount);
> > > > > > } else {
> > > > > > VM_BUG_ON_PAGE(compound, page);
> > > > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > > > --
> > > > >
> > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > > > >
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > >
> > > > > <SNIP>
> > > > >
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> > > >
> > > > Hm. I was not able to trigger this and don't see anything obviuous what can
> > > > lead to this kind of missmatch :-/
> >
> > I managed to trigger this when switched back from MADV_DONTNEED to
> > MADV_FREE. Hm..
>
> Hmm,,
> What version of MADV_FREE do you test on?
> Old MADV_FREE(ie, before posting MADV_FREE refactoring and fix KSM page)
> had a bug.
>
> I tried your patches on top of recent my MADV_FREE patches.
> But when I try it with old THP refcount redesign, I couldn't find
> any problem so far. However, I'm not saying it's your fault.
>
> I will give it a shot with MADV_DONTNEED to reproduce the problem.
> But one thing I could say is MADV_DONTNEED is more hard to hit
> compared to MADV_FREE because memory pressure of MADV_DONTNEED test
> wouldn't be heavy.

I reproduced this on the kernel which has no code related to MADV_FREE:

mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
MADV_FREE code in there
+ pte_mkdirty patch
+ freeze/unfreeze patch
+ do_page_add_anon_rmap patch

Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:1 val:511
BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:2 val:1

2015-11-04 14:21:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Wed, Nov 04, 2015 at 12:20:19AM +0900, Minchan Kim wrote:
> On Tue, Nov 03, 2015 at 04:33:29PM +0900, Minchan Kim wrote:
> > On Tue, Nov 03, 2015 at 09:16:50AM +0200, Kirill A. Shutemov wrote:
> > > On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> > > > Hello Kirill,
> > > >
> > > > On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > > > > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > > > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > > > > Hello Hugh,
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > > > > to be true.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > > > > >
> > > > > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > > > > >
> > > > > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > > > > series and will test it again.
> > > > > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > > > > >
> > > > > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > > > > >
> > > > > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > > > > finally.
> > > > > > > > >
> > > > > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > > > > looks like it works.
> > > > > > > > >
> > > > > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > > > > >
> > > > > > > > > But turn out that's not true.
> > > > > > > > >
> > > > > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > > > > >
> > > > > > > > > BOOM!
> > > > > > > > >
> > > > > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > > > > will be removed under us we don't corrupt page's state.
> > > > > > > > >
> > > > > > > > > Please, test.
> > > > > > > > >
> > > > > > > >
> > > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > > > > >
> > > > > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > > > > page->mem_cgroup:ffff88007f613c00
> > > > > > >
> > > > > > > Ignore my previous answer. Still sleeping.
> > > > > > >
> > > > > > > The right way to fix I think is something like:
> > > > > > >
> > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > > index 35643176bc15..f2d46792a554 100644
> > > > > > > --- a/mm/rmap.c
> > > > > > > +++ b/mm/rmap.c
> > > > > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > > > > bool compound = flags & RMAP_COMPOUND;
> > > > > > > bool first;
> > > > > > >
> > > > > > > - if (PageTransCompound(page)) {
> > > > > > > + if (PageTransCompound(page) && compound) {
> > > > > > > + atomic_t *mapcount;
> > > > > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > > > - if (compound) {
> > > > > > > - atomic_t *mapcount;
> > > > > > > -
> > > > > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > - mapcount = compound_mapcount_ptr(page);
> > > > > > > - first = atomic_inc_and_test(mapcount);
> > > > > > > - } else {
> > > > > > > - /* Anon THP always mapped first with PMD */
> > > > > > > - first = 0;
> > > > > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > > > > - atomic_inc(&page->_mapcount);
> > > > > > > - }
> > > > > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > + mapcount = compound_mapcount_ptr(page);
> > > > > > > + first = atomic_inc_and_test(mapcount);
> > > > > > > } else {
> > > > > > > VM_BUG_ON_PAGE(compound, page);
> > > > > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > > > > --
> > > > > >
> > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > > > > >
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > >
> > > > > > <SNIP>
> > > > > >
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> > > > >
> > > > > Hm. I was not able to trigger this and don't see anything obviuous what can
> > > > > lead to this kind of missmatch :-/
> > >
> > > I managed to trigger this when switched back from MADV_DONTNEED to
> > > MADV_FREE. Hm..
> >
> > Hmm,,
> > What version of MADV_FREE do you test on?
> > Old MADV_FREE(ie, before posting MADV_FREE refactoring and fix KSM page)
> > had a bug.
> >
> > I tried your patches on top of recent my MADV_FREE patches.
> > But when I try it with old THP refcount redesign, I couldn't find
> > any problem so far. However, I'm not saying it's your fault.
> >
> > I will give it a shot with MADV_DONTNEED to reproduce the problem.
> > But one thing I could say is MADV_DONTNEED is more hard to hit
> > compared to MADV_FREE because memory pressure of MADV_DONTNEED test
> > wouldn't be heavy.
>
> I reproduced this on the kernel which has no code related to MADV_FREE:
>
> mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> MADV_FREE code in there
> + pte_mkdirty patch
> + freeze/unfreeze patch
> + do_page_add_anon_rmap patch
>
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:1 val:511
> BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:2 val:1

I have one idea why it could happen, but not sure yet..

Could you check if it makes any difference for you?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5c7b00e88236..194f7f8b8c66 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -103,12 +103,7 @@ void deferred_split_huge_page(struct page *page);
void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address);

-#define split_huge_pmd(__vma, __pmd, __address) \
- do { \
- pmd_t *____pmd = (__pmd); \
- if (pmd_trans_huge(*____pmd)) \
- __split_huge_pmd(__vma, __pmd, __address); \
- } while (0)
+#define split_huge_pmd(__vma, __pmd, __address) __split_huge_pmd(__vma, __pmd, __address)
--
Kirill A. Shutemov

2015-11-05 00:19:16

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Wed, Nov 04, 2015 at 04:21:35PM +0200, Kirill A. Shutemov wrote:
> On Wed, Nov 04, 2015 at 12:20:19AM +0900, Minchan Kim wrote:
> > On Tue, Nov 03, 2015 at 04:33:29PM +0900, Minchan Kim wrote:
> > > On Tue, Nov 03, 2015 at 09:16:50AM +0200, Kirill A. Shutemov wrote:
> > > > On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> > > > > Hello Kirill,
> > > > >
> > > > > On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > > > > > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > > > > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > > > > > Hello Hugh,
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > > > > > to be true.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > > > > > >
> > > > > > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > > > > > series and will test it again.
> > > > > > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > > > > > >
> > > > > > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > > > > > >
> > > > > > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > > > > > finally.
> > > > > > > > > >
> > > > > > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > > > > > looks like it works.
> > > > > > > > > >
> > > > > > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > > > > > >
> > > > > > > > > > But turn out that's not true.
> > > > > > > > > >
> > > > > > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > > > > > >
> > > > > > > > > > BOOM!
> > > > > > > > > >
> > > > > > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > > > > > will be removed under us we don't corrupt page's state.
> > > > > > > > > >
> > > > > > > > > > Please, test.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > > > > > >
> > > > > > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > > > > > page->mem_cgroup:ffff88007f613c00
> > > > > > > >
> > > > > > > > Ignore my previous answer. Still sleeping.
> > > > > > > >
> > > > > > > > The right way to fix I think is something like:
> > > > > > > >
> > > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > > > index 35643176bc15..f2d46792a554 100644
> > > > > > > > --- a/mm/rmap.c
> > > > > > > > +++ b/mm/rmap.c
> > > > > > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > > > > > bool compound = flags & RMAP_COMPOUND;
> > > > > > > > bool first;
> > > > > > > >
> > > > > > > > - if (PageTransCompound(page)) {
> > > > > > > > + if (PageTransCompound(page) && compound) {
> > > > > > > > + atomic_t *mapcount;
> > > > > > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > > > > - if (compound) {
> > > > > > > > - atomic_t *mapcount;
> > > > > > > > -
> > > > > > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > > - mapcount = compound_mapcount_ptr(page);
> > > > > > > > - first = atomic_inc_and_test(mapcount);
> > > > > > > > - } else {
> > > > > > > > - /* Anon THP always mapped first with PMD */
> > > > > > > > - first = 0;
> > > > > > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > > > > > - atomic_inc(&page->_mapcount);
> > > > > > > > - }
> > > > > > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > > + mapcount = compound_mapcount_ptr(page);
> > > > > > > > + first = atomic_inc_and_test(mapcount);
> > > > > > > > } else {
> > > > > > > > VM_BUG_ON_PAGE(compound, page);
> > > > > > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > > > > > --
> > > > > > >
> > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > > > > > >
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > >
> > > > > > > <SNIP>
> > > > > > >
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> > > > > >
> > > > > > Hm. I was not able to trigger this and don't see anything obviuous what can
> > > > > > lead to this kind of missmatch :-/
> > > >
> > > > I managed to trigger this when switched back from MADV_DONTNEED to
> > > > MADV_FREE. Hm..
> > >
> > > Hmm,,
> > > What version of MADV_FREE do you test on?
> > > Old MADV_FREE(ie, before posting MADV_FREE refactoring and fix KSM page)
> > > had a bug.
> > >
> > > I tried your patches on top of recent my MADV_FREE patches.
> > > But when I try it with old THP refcount redesign, I couldn't find
> > > any problem so far. However, I'm not saying it's your fault.
> > >
> > > I will give it a shot with MADV_DONTNEED to reproduce the problem.
> > > But one thing I could say is MADV_DONTNEED is more hard to hit
> > > compared to MADV_FREE because memory pressure of MADV_DONTNEED test
> > > wouldn't be heavy.
> >
> > I reproduced this on the kernel which has no code related to MADV_FREE:
> >
> > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> > MADV_FREE code in there
> > + pte_mkdirty patch
> > + freeze/unfreeze patch
> > + do_page_add_anon_rmap patch
> >
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:1 val:511
> > BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:2 val:1
>
> I have one idea why it could happen, but not sure yet..
>
> Could you check if it makes any difference for you?
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5c7b00e88236..194f7f8b8c66 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -103,12 +103,7 @@ void deferred_split_huge_page(struct page *page);
> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long address);
>
> -#define split_huge_pmd(__vma, __pmd, __address) \
> - do { \
> - pmd_t *____pmd = (__pmd); \
> - if (pmd_trans_huge(*____pmd)) \
> - __split_huge_pmd(__vma, __pmd, __address); \
> - } while (0)
> +#define split_huge_pmd(__vma, __pmd, __address) __split_huge_pmd(__vma, __pmd, __address)

mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
MADV_FREE code in there
+ pte_mkdirty patch
+ freeze/unfreeze patch
+ do_page_add_anon_rmap patch
+ above split_huge_pmd


Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
BUG: Bad rss-counter state mm:ffff88007fa3bb80 idx:1 val:512
Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810782a9>] down_read_trylock+0x9/0x30
PGD 0
Oops: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 11 PID: 59 Comm: khugepaged Not tainted 4.3.0-rc5-mm1-no-madv-free+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800b9851a40 ti: ffff8800b985c000 task.ti: ffff8800b985c000
RIP: 0010:[<ffffffff810782a9>] [<ffffffff810782a9>] down_read_trylock+0x9/0x30
RSP: 0018:ffff8800b985f778 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0000154b80 RCX: ffff8800b985f918
RDX: 0000000000000000 RSI: ffff8800b985f818 RDI: 0000000000000008
RBP: ffff8800b985f778 R08: ffffffff818446a0 R09: ffff8800b903cff8
R10: ffff8800b903d168 R11: ffff8800b985f7b8 R12: ffff88007ef6c731
R13: ffff88007ef6c730 R14: 0000000000000008 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff8800bfb60000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000001808000 CR4: 00000000000006a0
Stack:
ffff8800b985f7a8 ffffffff81124f20 ffffea0000154b80 ffff8800b985f818
ffff88007ff4dc00 0000000000000000 ffff8800b985f7f0 ffffffff81125663
0000000000000000 ffffffff818446a0 ffffea0000154b80 ffff8800b985f918
Call Trace:
[<ffffffff81124f20>] page_lock_anon_vma_read+0x60/0x180
[<ffffffff81125663>] rmap_walk+0x1b3/0x3f0
[<ffffffff81125a43>] page_referenced+0x1a3/0x220
[<ffffffff81123e20>] ? __page_check_address+0x1a0/0x1a0
[<ffffffff81124ec0>] ? page_get_anon_vma+0xd0/0xd0
[<ffffffff81123810>] ? anon_vma_ctor+0x40/0x40
[<ffffffff8110086b>] shrink_page_list+0x5ab/0xde0
[<ffffffff8110173c>] shrink_inactive_list+0x18c/0x4b0
[<ffffffff811023ad>] shrink_lruvec+0x59d/0x740
[<ffffffff811025e0>] shrink_zone+0x90/0x250
[<ffffffff811028cd>] do_try_to_free_pages+0x12d/0x3b0
[<ffffffff81102d2d>] try_to_free_mem_cgroup_pages+0x9d/0x120
[<ffffffff81149949>] try_charge+0x1f9/0x670
[<ffffffff810fb030>] ? lru_cache_add_file+0x40/0x40
[<ffffffff8114d0a6>] mem_cgroup_try_charge+0x86/0x120
[<ffffffff811433bc>] khugepaged+0x7cc/0x1ac0
[<ffffffff81064f01>] ? __clear_sched_clock_stable+0x11/0x20
[<ffffffff81072430>] ? prepare_to_wait_event+0xf0/0xf0
[<ffffffff81142bf0>] ? __split_huge_pmd_locked+0x4a0/0x4a0
[<ffffffff81056cd9>] kthread+0xc9/0xe0
[<ffffffff81056c10>] ? kthread_park+0x60/0x60
[<ffffffff8142066f>] ret_from_fork+0x3f/0x70
[<ffffffff81056c10>] ? kthread_park+0x60/0x60
Code: 6e 7b 3a 00 48 83 c4 08 5b 5d c3 48 89 45 f0 e8 ab 63 3a 00 48 8b 45 f0 eb df 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7


2015-11-08 22:55:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Thu, Nov 05, 2015 at 09:19:22AM +0900, Minchan Kim wrote:
> On Wed, Nov 04, 2015 at 04:21:35PM +0200, Kirill A. Shutemov wrote:
> > On Wed, Nov 04, 2015 at 12:20:19AM +0900, Minchan Kim wrote:
> > > On Tue, Nov 03, 2015 at 04:33:29PM +0900, Minchan Kim wrote:
> > > > On Tue, Nov 03, 2015 at 09:16:50AM +0200, Kirill A. Shutemov wrote:
> > > > > On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> > > > > > Hello Kirill,
> > > > > >
> > > > > > On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > > > > > > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > > > > > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > > > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > > > > > > Hello Hugh,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > > > > > > to be true.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > > > > > > series and will test it again.
> > > > > > > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > > > > > > >
> > > > > > > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > > > > > > >
> > > > > > > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > > > > > > finally.
> > > > > > > > > > >
> > > > > > > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > > > > > > looks like it works.
> > > > > > > > > > >
> > > > > > > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > > > > > > >
> > > > > > > > > > > But turn out that's not true.
> > > > > > > > > > >
> > > > > > > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > > > > > > >
> > > > > > > > > > > BOOM!
> > > > > > > > > > >
> > > > > > > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > > > > > > will be removed under us we don't corrupt page's state.
> > > > > > > > > > >
> > > > > > > > > > > Please, test.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > > > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > > > > > > >
> > > > > > > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > > > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > > > > > > page->mem_cgroup:ffff88007f613c00
> > > > > > > > >
> > > > > > > > > Ignore my previous answer. Still sleeping.
> > > > > > > > >
> > > > > > > > > The right way to fix I think is something like:
> > > > > > > > >
> > > > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > > > > index 35643176bc15..f2d46792a554 100644
> > > > > > > > > --- a/mm/rmap.c
> > > > > > > > > +++ b/mm/rmap.c
> > > > > > > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > > > > > > bool compound = flags & RMAP_COMPOUND;
> > > > > > > > > bool first;
> > > > > > > > >
> > > > > > > > > - if (PageTransCompound(page)) {
> > > > > > > > > + if (PageTransCompound(page) && compound) {
> > > > > > > > > + atomic_t *mapcount;
> > > > > > > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > > > > > - if (compound) {
> > > > > > > > > - atomic_t *mapcount;
> > > > > > > > > -
> > > > > > > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > > > - mapcount = compound_mapcount_ptr(page);
> > > > > > > > > - first = atomic_inc_and_test(mapcount);
> > > > > > > > > - } else {
> > > > > > > > > - /* Anon THP always mapped first with PMD */
> > > > > > > > > - first = 0;
> > > > > > > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > > > > > > - atomic_inc(&page->_mapcount);
> > > > > > > > > - }
> > > > > > > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > > > + mapcount = compound_mapcount_ptr(page);
> > > > > > > > > + first = atomic_inc_and_test(mapcount);
> > > > > > > > > } else {
> > > > > > > > > VM_BUG_ON_PAGE(compound, page);
> > > > > > > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > > > > > > --
> > > > > > > >
> > > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > > > > > > >
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > >
> > > > > > > > <SNIP>
> > > > > > > >
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> > > > > > >
> > > > > > > Hm. I was not able to trigger this and don't see anything obviuous what can
> > > > > > > lead to this kind of missmatch :-/
> > > > >
> > > > > I managed to trigger this when switched back from MADV_DONTNEED to
> > > > > MADV_FREE. Hm..
> > > >
> > > > Hmm,,
> > > > What version of MADV_FREE do you test on?
> > > > Old MADV_FREE(ie, before posting MADV_FREE refactoring and fix KSM page)
> > > > had a bug.
> > > >
> > > > I tried your patches on top of recent my MADV_FREE patches.
> > > > But when I try it with old THP refcount redesign, I couldn't find
> > > > any problem so far. However, I'm not saying it's your fault.
> > > >
> > > > I will give it a shot with MADV_DONTNEED to reproduce the problem.
> > > > But one thing I could say is MADV_DONTNEED is more hard to hit
> > > > compared to MADV_FREE because memory pressure of MADV_DONTNEED test
> > > > wouldn't be heavy.
> > >
> > > I reproduced this on the kernel which has no code related to MADV_FREE:
> > >
> > > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> > > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> > > MADV_FREE code in there
> > > + pte_mkdirty patch
> > > + freeze/unfreeze patch
> > > + do_page_add_anon_rmap patch
> > >
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:1 val:511
> > > BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:2 val:1
> >
> > I have one idea why it could happen, but not sure yet..
> >
> > Could you check if it makes any difference for you?
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5c7b00e88236..194f7f8b8c66 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -103,12 +103,7 @@ void deferred_split_huge_page(struct page *page);
> > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > unsigned long address);
> >
> > -#define split_huge_pmd(__vma, __pmd, __address) \
> > - do { \
> > - pmd_t *____pmd = (__pmd); \
> > - if (pmd_trans_huge(*____pmd)) \
> > - __split_huge_pmd(__vma, __pmd, __address); \
> > - } while (0)
> > +#define split_huge_pmd(__vma, __pmd, __address) __split_huge_pmd(__vma, __pmd, __address)
>
> mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> MADV_FREE code in there
> + pte_mkdirty patch
> + freeze/unfreeze patch
> + do_page_add_anon_rmap patch
> + above split_huge_pmd
>
>
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> BUG: Bad rss-counter state mm:ffff88007fa3bb80 idx:1 val:512

With the patch below my test setup run for 2+ days without triggering the
bug. split_huge_pmd patch should be dropped.

Please test.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 14cbbad54a3e..7aa0a3fef2aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2841,9 +2841,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = pmd_write(*pmd);
young = pmd_young(*pmd);

- /* leave pmd empty until pte is filled */
- pmdp_huge_clear_flush_notify(vma, haddr, pmd);
-
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);

@@ -2893,6 +2890,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}

smp_wmb(); /* make pte visible before pmd */
+ /*
+ * Up to this point the pmd is present and huge and userland has the
+ * whole access to the hugepage during the split (which happens in
+ * place). If we overwrite the pmd with the not-huge version pointing
+ * to the pte here (which of course we could if all CPUs were bug
+ * free), userland could trigger a small page size TLB miss on the
+ * small sized TLB while the hugepage TLB entry is still established in
+ * the huge TLB. Some CPU doesn't like that.
+ * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
+ * 383 on page 93. Intel should be safe but is also warns that it's
+ * only safe if the permission and cache attributes of the two entries
+ * loaded in the two TLB is identical (which should be the case here).
+ * But it is generally safer to never allow small and huge TLB entries
+ * for the same virtual address to be loaded simultaneously. So instead
+ * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
+ * current pmd notpresent (atomically because here the pmd_trans_huge
+ * and pmd_trans_splitting must remain set at all times on the pmd
+ * until the split is complete for this pmd), then we flush the SMP TLB
+ * and finally we write the non-huge version of the pmd entry with
+ * pmd_populate.
+ */
+ pmdp_invalidate(vma, haddr, pmd);
pmd_populate(mm, pmd, pgtable);

if (freeze) {
--
Kirill A. Shutemov

2015-11-12 00:35:59

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Mon, Nov 09, 2015 at 12:55:22AM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 05, 2015 at 09:19:22AM +0900, Minchan Kim wrote:
> > On Wed, Nov 04, 2015 at 04:21:35PM +0200, Kirill A. Shutemov wrote:
> > > On Wed, Nov 04, 2015 at 12:20:19AM +0900, Minchan Kim wrote:
> > > > On Tue, Nov 03, 2015 at 04:33:29PM +0900, Minchan Kim wrote:
> > > > > On Tue, Nov 03, 2015 at 09:16:50AM +0200, Kirill A. Shutemov wrote:
> > > > > > On Tue, Nov 03, 2015 at 12:02:58PM +0900, Minchan Kim wrote:
> > > > > > > Hello Kirill,
> > > > > > >
> > > > > > > On Mon, Nov 02, 2015 at 02:57:49PM +0200, Kirill A. Shutemov wrote:
> > > > > > > > On Fri, Oct 30, 2015 at 04:03:50PM +0900, Minchan Kim wrote:
> > > > > > > > > On Thu, Oct 29, 2015 at 11:52:06AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > > > On Thu, Oct 29, 2015 at 04:58:29PM +0900, Minchan Kim wrote:
> > > > > > > > > > > On Thu, Oct 29, 2015 at 02:25:24AM +0200, Kirill A. Shutemov wrote:
> > > > > > > > > > > > On Thu, Oct 22, 2015 at 06:00:51PM +0900, Minchan Kim wrote:
> > > > > > > > > > > > > On Thu, Oct 22, 2015 at 10:21:36AM +0900, Minchan Kim wrote:
> > > > > > > > > > > > > > Hello Hugh,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Oct 21, 2015 at 05:59:59PM -0700, Hugh Dickins wrote:
> > > > > > > > > > > > > > > On Thu, 22 Oct 2015, Minchan Kim wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I added the code to check it and queued it again but I had another oops
> > > > > > > > > > > > > > > > in this time but symptom is related to anon_vma, too.
> > > > > > > > > > > > > > > > (kernel is based on recent mmotm + unconditional mkdirty for bug fix)
> > > > > > > > > > > > > > > > It seems page_get_anon_vma returns NULL since the page was not page_mapped
> > > > > > > > > > > > > > > > at that time but second check of page_mapped right before try_to_unmap seems
> > > > > > > > > > > > > > > > to be true.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > > > > > > > > page:ffffea0001cfbfc0 count:3 mapcount:1 mapping:ffff88007f1b5f51 index:0x600000aff
> > > > > > > > > > > > > > > > flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
> > > > > > > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's interesting, that's one I added in my page migration series.
> > > > > > > > > > > > > > > Let me think on it, but it could well relate to the one you got before.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I will roll back to mm/madv_free-v4.3-rc5-mmotm-2015-10-15-15-20
> > > > > > > > > > > > > > instead of next-20151021 to remove noise from your migration cleanup
> > > > > > > > > > > > > > series and will test it again.
> > > > > > > > > > > > > > If it is fixed, I will test again with your migration patchset, then.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I tested mmotm-2015-10-15-15-20 with test program I attach for a long time.
> > > > > > > > > > > > > Therefore, there is no patchset from Hugh's migration patch in there.
> > > > > > > > > > > > > And I added below debug code with request from Kirill to all test kernels.
> > > > > > > > > > > >
> > > > > > > > > > > > It took too long time (and a lot of printk()), but I think I track it down
> > > > > > > > > > > > finally.
> > > > > > > > > > > >
> > > > > > > > > > > > The patch below seems fixes issue for me. It's not yet properly tested, but
> > > > > > > > > > > > looks like it works.
> > > > > > > > > > > >
> > > > > > > > > > > > The problem was my wrong assumption on how migration works: I thought that
> > > > > > > > > > > > kernel would wait migration to finish on before deconstruction mapping.
> > > > > > > > > > > >
> > > > > > > > > > > > But turn out that's not true.
> > > > > > > > > > > >
> > > > > > > > > > > > As result if zap_pte_range() races with split_huge_page(), we can end up
> > > > > > > > > > > > with page which is not mapped anymore but has _count and _mapcount
> > > > > > > > > > > > elevated. The page is on LRU too. So it's still reachable by vmscan and by
> > > > > > > > > > > > pfn scanners (Sasha showed few similar traces from compaction too).
> > > > > > > > > > > > It's likely that page->mapping in this case would point to freed anon_vma.
> > > > > > > > > > > >
> > > > > > > > > > > > BOOM!
> > > > > > > > > > > >
> > > > > > > > > > > > The patch modify freeze/unfreeze_page() code to match normal migration
> > > > > > > > > > > > entries logic: on setup we remove page from rmap and drop pin, on removing
> > > > > > > > > > > > we get pin back and put page on rmap. This way even if migration entry
> > > > > > > > > > > > will be removed under us we don't corrupt page's state.
> > > > > > > > > > > >
> > > > > > > > > > > > Please, test.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + your new patch, I tested
> > > > > > > > > > > one I sent to you(ie, oops.c + memcg_test.sh)
> > > > > > > > > > >
> > > > > > > > > > > page:ffffea00016a0000 count:3 mapcount:0 mapping:ffff88007f49d001 index:0x600001800 compound_mapcount: 0
> > > > > > > > > > > flags: 0x4000000000044009(locked|uptodate|head|swapbacked)
> > > > > > > > > > > page dumped because: VM_BUG_ON_PAGE(!page_mapcount(page))
> > > > > > > > > > > page->mem_cgroup:ffff88007f613c00
> > > > > > > > > >
> > > > > > > > > > Ignore my previous answer. Still sleeping.
> > > > > > > > > >
> > > > > > > > > > The right way to fix I think is something like:
> > > > > > > > > >
> > > > > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > > > > > > index 35643176bc15..f2d46792a554 100644
> > > > > > > > > > --- a/mm/rmap.c
> > > > > > > > > > +++ b/mm/rmap.c
> > > > > > > > > > @@ -1173,20 +1173,12 @@ void do_page_add_anon_rmap(struct page *page,
> > > > > > > > > > bool compound = flags & RMAP_COMPOUND;
> > > > > > > > > > bool first;
> > > > > > > > > >
> > > > > > > > > > - if (PageTransCompound(page)) {
> > > > > > > > > > + if (PageTransCompound(page) && compound) {
> > > > > > > > > > + atomic_t *mapcount;
> > > > > > > > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > > > > > > - if (compound) {
> > > > > > > > > > - atomic_t *mapcount;
> > > > > > > > > > -
> > > > > > > > > > - VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > > > > - mapcount = compound_mapcount_ptr(page);
> > > > > > > > > > - first = atomic_inc_and_test(mapcount);
> > > > > > > > > > - } else {
> > > > > > > > > > - /* Anon THP always mapped first with PMD */
> > > > > > > > > > - first = 0;
> > > > > > > > > > - VM_BUG_ON_PAGE(!page_mapcount(page), page);
> > > > > > > > > > - atomic_inc(&page->_mapcount);
> > > > > > > > > > - }
> > > > > > > > > > + VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > > > > > > > > + mapcount = compound_mapcount_ptr(page);
> > > > > > > > > > + first = atomic_inc_and_test(mapcount);
> > > > > > > > > > } else {
> > > > > > > > > > VM_BUG_ON_PAGE(compound, page);
> > > > > > > > > > first = atomic_inc_and_test(&page->_mapcount);
> > > > > > > > > > --
> > > > > > > > >
> > > > > > > > > kernel: On mmotm-2015-10-15-15-20 + pte_mkdirty patch + freeze/unfreeze patch + above patch,
> > > > > > > > >
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > >
> > > > > > > > > <SNIP>
> > > > > > > > >
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:1 val:511
> > > > > > > > > BUG: Bad rss-counter state mm:ffff880046980700 idx:2 val:1
> > > > > > > >
> > > > > > > > Hm. I was not able to trigger this and don't see anything obviuous what can
> > > > > > > > lead to this kind of missmatch :-/
> > > > > >
> > > > > > I managed to trigger this when switched back from MADV_DONTNEED to
> > > > > > MADV_FREE. Hm..
> > > > >
> > > > > Hmm,,
> > > > > What version of MADV_FREE do you test on?
> > > > > Old MADV_FREE(ie, before posting MADV_FREE refactoring and fix KSM page)
> > > > > had a bug.
> > > > >
> > > > > I tried your patches on top of recent my MADV_FREE patches.
> > > > > But when I try it with old THP refcount redesign, I couldn't find
> > > > > any problem so far. However, I'm not saying it's your fault.
> > > > >
> > > > > I will give it a shot with MADV_DONTNEED to reproduce the problem.
> > > > > But one thing I could say is MADV_DONTNEED is more hard to hit
> > > > > compared to MADV_FREE because memory pressure of MADV_DONTNEED test
> > > > > wouldn't be heavy.
> > > >
> > > > I reproduced this on the kernel which has no code related to MADV_FREE:
> > > >
> > > > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> > > > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> > > > MADV_FREE code in there
> > > > + pte_mkdirty patch
> > > > + freeze/unfreeze patch
> > > > + do_page_add_anon_rmap patch
> > > >
> > > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > > BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:1 val:511
> > > > BUG: Bad rss-counter state mm:ffff88007fdd5b00 idx:2 val:1
> > >
> > > I have one idea why it could happen, but not sure yet..
> > >
> > > Could you check if it makes any difference for you?
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 5c7b00e88236..194f7f8b8c66 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -103,12 +103,7 @@ void deferred_split_huge_page(struct page *page);
> > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > > unsigned long address);
> > >
> > > -#define split_huge_pmd(__vma, __pmd, __address) \
> > > - do { \
> > > - pmd_t *____pmd = (__pmd); \
> > > - if (pmd_trans_huge(*____pmd)) \
> > > - __split_huge_pmd(__vma, __pmd, __address); \
> > > - } while (0)
> > > +#define split_huge_pmd(__vma, __pmd, __address) __split_huge_pmd(__vma, __pmd, __address)
> >
> > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> > MADV_FREE code in there
> > + pte_mkdirty patch
> > + freeze/unfreeze patch
> > + do_page_add_anon_rmap patch
> > + above split_huge_pmd
> >
> >
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > BUG: Bad rss-counter state mm:ffff88007fa3bb80 idx:1 val:512
>
> With the patch below my test setup run for 2+ days without triggering the
> bug. split_huge_pmd patch should be dropped.
>
> Please test.
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 14cbbad54a3e..7aa0a3fef2aa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2841,9 +2841,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> write = pmd_write(*pmd);
> young = pmd_young(*pmd);
>
> - /* leave pmd empty until pte is filled */
> - pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> -
> pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> pmd_populate(mm, &_pmd, pgtable);
>
> @@ -2893,6 +2890,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> }
>
> smp_wmb(); /* make pte visible before pmd */
> + /*
> + * Up to this point the pmd is present and huge and userland has the
> + * whole access to the hugepage during the split (which happens in
> + * place). If we overwrite the pmd with the not-huge version pointing
> + * to the pte here (which of course we could if all CPUs were bug
> + * free), userland could trigger a small page size TLB miss on the
> + * small sized TLB while the hugepage TLB entry is still established in
> + * the huge TLB. Some CPU doesn't like that.
> + * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
> + * 383 on page 93. Intel should be safe but is also warns that it's
> + * only safe if the permission and cache attributes of the two entries
> + * loaded in the two TLB is identical (which should be the case here).
> + * But it is generally safer to never allow small and huge TLB entries
> + * for the same virtual address to be loaded simultaneously. So instead
> + * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
> + * current pmd notpresent (atomically because here the pmd_trans_huge
> + * and pmd_trans_splitting must remain set at all times on the pmd
> + * until the split is complete for this pmd), then we flush the SMP TLB
> + * and finally we write the non-huge version of the pmd entry with
> + * pmd_populate.
> + */
> + pmdp_invalidate(vma, haddr, pmd);
> pmd_populate(mm, pmd, pgtable);
>
> if (freeze) {

I have been tested this patch with MADV_DONTNEED for a few days and
I couldn't see the problem any more. And I will continue to test it
with MADV_FREE.

Thanks.

2015-11-16 01:44:45

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Thu, Nov 12, 2015 at 09:36:14AM +0900, Minchan Kim wrote:

<snip>

> > > mmotm-2015-10-15-15-20-no-madvise_free, IOW it means git head for
> > > 54bad5da4834 arm64: add pmd_[dirty|mkclean] for THP so there is no
> > > MADV_FREE code in there
> > > + pte_mkdirty patch
> > > + freeze/unfreeze patch
> > > + do_page_add_anon_rmap patch
> > > + above split_huge_pmd
> > >
> > >
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > Adding 4191228k swap on /dev/vda5. Priority:-1 extents:1 across:4191228k FS
> > > BUG: Bad rss-counter state mm:ffff88007fa3bb80 idx:1 val:512
> >
> > With the patch below my test setup run for 2+ days without triggering the
> > bug. split_huge_pmd patch should be dropped.
> >
> > Please test.
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 14cbbad54a3e..7aa0a3fef2aa 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2841,9 +2841,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > write = pmd_write(*pmd);
> > young = pmd_young(*pmd);
> >
> > - /* leave pmd empty until pte is filled */
> > - pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> > -
> > pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> > pmd_populate(mm, &_pmd, pgtable);
> >
> > @@ -2893,6 +2890,28 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > }
> >
> > smp_wmb(); /* make pte visible before pmd */
> > + /*
> > + * Up to this point the pmd is present and huge and userland has the
> > + * whole access to the hugepage during the split (which happens in
> > + * place). If we overwrite the pmd with the not-huge version pointing
> > + * to the pte here (which of course we could if all CPUs were bug
> > + * free), userland could trigger a small page size TLB miss on the
> > + * small sized TLB while the hugepage TLB entry is still established in
> > + * the huge TLB. Some CPU doesn't like that.
> > + * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
> > + * 383 on page 93. Intel should be safe but is also warns that it's
> > + * only safe if the permission and cache attributes of the two entries
> > + * loaded in the two TLB is identical (which should be the case here).
> > + * But it is generally safer to never allow small and huge TLB entries
> > + * for the same virtual address to be loaded simultaneously. So instead
> > + * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
> > + * current pmd notpresent (atomically because here the pmd_trans_huge
> > + * and pmd_trans_splitting must remain set at all times on the pmd
> > + * until the split is complete for this pmd), then we flush the SMP TLB
> > + * and finally we write the non-huge version of the pmd entry with
> > + * pmd_populate.
> > + */
> > + pmdp_invalidate(vma, haddr, pmd);
> > pmd_populate(mm, pmd, pgtable);
> >
> > if (freeze) {
>
> I have been tested this patch with MADV_DONTNEED for a few days and
> I couldn't see the problem any more. And I will continue to test it
> with MADV_FREE.

During the test with MADV_FREE on kernel I applied your patches,
I couldn't see any problem.

However, in this round, I did another test which is same one
I attached but a liitle bit different because it doesn't do
(memcg things/kill/swapoff) for testing program long-live test.

With that, I encountered this problem.

page:ffffea0000f60080 count:1 mapcount:0 mapping:ffff88007f584691 index:0x600002a02
flags: 0x400000000006a028(uptodate|lru|writeback|swapcache|reclaim|swapbacked)
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
page->mem_cgroup:ffff880077cf0c00
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:3340!
invalid opcode: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 7 PID: 1657 Comm: memhog Not tainted 4.3.0-rc5-mm1-madv-free+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006b0f1a40 ti: ffff88004ced4000 task.ti: ffff88004ced4000
RIP: 0010:[<ffffffff8114bf67>] [<ffffffff8114bf67>] split_huge_page_to_list+0x907/0x920
RSP: 0018:ffff88004ced7a38 EFLAGS: 00010296
RAX: 0000000000000021 RBX: ffffea0000f60080 RCX: ffffffff81830db8
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
RBP: ffff88004ced7ab8 R08: 0000000000000000 R09: ffff8800000bc560
R10: ffffffff8163d880 R11: 0000000000014f25 R12: ffffea0000f60080
R13: ffffea0000f60088 R14: ffffea0000f60080 R15: 0000000000000000
FS: 00007f43d3ced740(0000) GS:ffff8800782e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff1f6fcdb98 CR3: 000000004cf56000 CR4: 00000000000006a0
Stack:
cccccccccccccccd ffffea0000f60080 ffff88004ced7ad0 ffffea0000f60088
ffff88004ced7ad0 0000000000000000 ffff88004ced7ab8 ffffffff810ef9d0
ffffea0000f60000 0000000000000000 0000000000000000 ffffea0000f60080
Call Trace:
[<ffffffff810ef9d0>] ? __lock_page+0xa0/0xb0
[<ffffffff8114c09c>] deferred_split_scan+0x11c/0x260
[<ffffffff81117bfc>] ? list_lru_count_one+0x1c/0x30
[<ffffffff81101333>] shrink_slab.part.42+0x1e3/0x350
[<ffffffff81105daa>] shrink_zone+0x26a/0x280
[<ffffffff81105eed>] do_try_to_free_pages+0x12d/0x3b0
[<ffffffff81106224>] try_to_free_pages+0xb4/0x140
[<ffffffff810f8a59>] __alloc_pages_nodemask+0x459/0x920
[<ffffffff8111e667>] handle_mm_fault+0xc77/0x1000
[<ffffffff8142718d>] ? retint_kernel+0x10/0x10
[<ffffffff81033629>] __do_page_fault+0x189/0x400
[<ffffffff810338ac>] do_page_fault+0xc/0x10
[<ffffffff81428142>] page_fault+0x22/0x30
Code: ff ff 48 c7 c6 f0 b2 77 81 4c 89 f7 e8 13 c3 fc ff 0f 0b 48 83 e8 01 e9 88 f7 ff ff 48 c7 c6 70 a1 77 81 4c 89 f7 e8 f9 c2 fc ff <0f> 0b 48 c7 c6 38 af 77 81 4c 89 e7 e8 e8 c2 fc ff 0f 0b 66 0f
RIP [<ffffffff8114bf67>] split_huge_page_to_list+0x907/0x920
RSP <ffff88004ced7a38>
---[ end trace c9a60522e3a296e4 ]---


So, I reverted all MADV_FREE patches and chaged it with MADV_DONTNEED.
In this time, I saw below oops in this time.
If I miss somethings, please let me know it.

------------[ cut here ]------------
kernel BUG at include/linux/swapops.h:129!
invalid opcode: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 5 PID: 1563 Comm: madvise_test Not tainted 4.3.0-rc5-mm1-no-madv-free+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88007e8d3480 ti: ffff88007f524000 task.ti: ffff88007f524000
RIP: 0010:[<ffffffff811504be>] [<ffffffff811504be>] migration_entry_to_page.part.61+0x4/0x6
RSP: 0018:ffff88007f527cd0 EFLAGS: 00010246
RAX: ffffea0000896b00 RBX: 00006000013ac000 RCX: ffffea0000000000
RDX: 0000000000000000 RSI: ffffea0001f93e80 RDI: 3e000000000225ac
RBP: ffff88007f527cd0 R08: 0000000000000101 R09: ffff88007e4fa000
R10: ffffea0001fda740 R11: 0000000000000000 R12: 00000000044b583e
R13: 00006000013ad000 R14: ffff88007f527e00 R15: ffff88007e4fad60
FS: 00007fe2f099a740(0000) GS:ffff8800782a0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000166c0d0 CR3: 000000007e57b000 CR4: 00000000000006a0
Stack:
ffff88007f527db8 ffffffff81118030 00006000017fffff ffff88007f527e00
00006000017fffff ffff88007ed71000 ffff88007e57b600 0000600001800000
0000600001800000 00006000017fffff 0000600001800000 ffff88007efb6b78
Call Trace:
[<ffffffff81118030>] unmap_single_vma+0x840/0x880
[<ffffffff811188a1>] unmap_vmas+0x41/0x60
[<ffffffff8111dfad>] unmap_region+0x9d/0x100
[<ffffffff81120007>] do_munmap+0x217/0x380
[<ffffffff811201b1>] vm_munmap+0x41/0x60
[<ffffffff811210d2>] SyS_munmap+0x22/0x30
[<ffffffff81420357>] entry_SYSCALL_64_fastpath+0x12/0x6a
Code: df 48 c1 ff 06 49 01 fc 4c 89 e7 e8 9c ff ff ff 85 c0 74 0c 4c 89 e0 48 c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 <0f> 0b 55 48 c7 c6 30 80 77 81 48 89 e5 e8 f0 45 fc ff 0f 0b 55
RIP [<ffffffff811504be>] migration_entry_to_page.part.61+0x4/0x6
RSP <ffff88007f527cd0>
---[ end trace 01097fb7f9cf1b6c ]---

Another hit:

page:ffffea0000520080 count:2 mapcount:0 mapping:ffff880072b38a51 index:0x600002602
flags: 0x4000000000048028(uptodate|lru|swapcache|swapbacked)
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
page->mem_cgroup:ffff880077cf0c00
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:3306!
invalid opcode: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 6 PID: 1419 Comm: madvise_test Not tainted 4.3.0-rc5-mm1-no-madv-free+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88006f108000 ti: ffff88006f054000 task.ti: ffff88006f054000
RIP: 0010:[<ffffffff811473bf>] [<ffffffff811473bf>] split_huge_page_to_list+0x81f/0x890
RSP: 0000:ffff88006f057a40 EFLAGS: 00010282
RAX: 0000000000000021 RBX: ffffea0000520080 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821dd418
RBP: ffff88006f057ab8 R08: 0000000000000000 R09: ffff8800000bfb20
R10: ffffffff8163d1c0 R11: 0000000000005c5f R12: ffff88006f057ad0
R13: ffffea0000520080 R14: ffffea0000520080 R15: 0000000000000000
FS: 00007f09963a2740(0000) GS:ffff8800782c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000600003d92000 CR3: 000000007372e000 CR4: 00000000000006a0
Stack:
ffffea0000520080 ffff88006f057ad0 ffffea0000520088 ffff88006f057ad0
0000000000000000 ffff88006f057ab8 ffffffff810ec700 ffffea0000520000
0000000000000000 0000000000000000 ffffea0000520080 ffff88006f057ad0
Call Trace:
[<ffffffff810ec700>] ? __lock_page+0xa0/0xb0
[<ffffffff81147545>] deferred_split_scan+0x115/0x240
[<ffffffff8111445c>] ? list_lru_count_one+0x1c/0x30
[<ffffffff810fdd63>] shrink_slab.part.43+0x1e3/0x350
[<ffffffff81102788>] shrink_zone+0x238/0x250
[<ffffffff811028cd>] do_try_to_free_pages+0x12d/0x3b0
[<ffffffff81102c04>] try_to_free_pages+0xb4/0x140
[<ffffffff810f57b9>] __alloc_pages_nodemask+0x459/0x920
[<ffffffff8111aa2a>] handle_mm_fault+0xbca/0xf90
[<ffffffff8105b8bc>] ? enqueue_task+0x3c/0x60
[<ffffffff810602eb>] ? __set_cpus_allowed_ptr+0x9b/0x1a0
[<ffffffff81032b49>] __do_page_fault+0x189/0x400
[<ffffffff81032dcc>] do_page_fault+0xc/0x10
[<ffffffff81421e02>] page_fault+0x22/0x30
Code: ff ff 48 c7 c6 d0 91 77 81 4c 89 f7 e8 1b d7 fc ff 0f 0b 48 83 e8 01 e9 70 f8 ff ff 48 c7 c6 50 80 77 81 4c 89 f7 e8 01 d7 fc ff <0f> 0b 48 c7 c6 d8 be 77 81 4c 89 ef e8 f0 d6 fc ff 0f 0b 48 83
RIP [<ffffffff811473bf>] split_huge_page_to_list+0x81f/0x890
RSP <ffff88006f057a40>
---[ end trace 0ce8751b8410cd8e ]---

2015-11-16 08:45:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> During the test with MADV_FREE on kernel I applied your patches,
> I couldn't see any problem.
>
> However, in this round, I did another test which is same one
> I attached but a liitle bit different because it doesn't do
> (memcg things/kill/swapoff) for testing program long-live test.

Could you share updated test?

And could you try to reproduce it on clean mmotm-2015-11-10-15-53?

> With that, I encountered this problem.
>
> page:ffffea0000f60080 count:1 mapcount:0 mapping:ffff88007f584691 index:0x600002a02
> flags: 0x400000000006a028(uptodate|lru|writeback|swapcache|reclaim|swapbacked)
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> page->mem_cgroup:ffff880077cf0c00
> ------------[ cut here ]------------
> kernel BUG at mm/huge_memory.c:3340!
> invalid opcode: 0000 [#1] SMP
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 7 PID: 1657 Comm: memhog Not tainted 4.3.0-rc5-mm1-madv-free+ #4
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: ffff88006b0f1a40 ti: ffff88004ced4000 task.ti: ffff88004ced4000
> RIP: 0010:[<ffffffff8114bf67>] [<ffffffff8114bf67>] split_huge_page_to_list+0x907/0x920
> RSP: 0018:ffff88004ced7a38 EFLAGS: 00010296
> RAX: 0000000000000021 RBX: ffffea0000f60080 RCX: ffffffff81830db8
> RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
> RBP: ffff88004ced7ab8 R08: 0000000000000000 R09: ffff8800000bc560
> R10: ffffffff8163d880 R11: 0000000000014f25 R12: ffffea0000f60080
> R13: ffffea0000f60088 R14: ffffea0000f60080 R15: 0000000000000000
> FS: 00007f43d3ced740(0000) GS:ffff8800782e0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ff1f6fcdb98 CR3: 000000004cf56000 CR4: 00000000000006a0
> Stack:
> cccccccccccccccd ffffea0000f60080 ffff88004ced7ad0 ffffea0000f60088
> ffff88004ced7ad0 0000000000000000 ffff88004ced7ab8 ffffffff810ef9d0
> ffffea0000f60000 0000000000000000 0000000000000000 ffffea0000f60080
> Call Trace:
> [<ffffffff810ef9d0>] ? __lock_page+0xa0/0xb0
> [<ffffffff8114c09c>] deferred_split_scan+0x11c/0x260
> [<ffffffff81117bfc>] ? list_lru_count_one+0x1c/0x30
> [<ffffffff81101333>] shrink_slab.part.42+0x1e3/0x350
> [<ffffffff81105daa>] shrink_zone+0x26a/0x280
> [<ffffffff81105eed>] do_try_to_free_pages+0x12d/0x3b0
> [<ffffffff81106224>] try_to_free_pages+0xb4/0x140
> [<ffffffff810f8a59>] __alloc_pages_nodemask+0x459/0x920
> [<ffffffff8111e667>] handle_mm_fault+0xc77/0x1000
> [<ffffffff8142718d>] ? retint_kernel+0x10/0x10
> [<ffffffff81033629>] __do_page_fault+0x189/0x400
> [<ffffffff810338ac>] do_page_fault+0xc/0x10
> [<ffffffff81428142>] page_fault+0x22/0x30
> Code: ff ff 48 c7 c6 f0 b2 77 81 4c 89 f7 e8 13 c3 fc ff 0f 0b 48 83 e8 01 e9 88 f7 ff ff 48 c7 c6 70 a1 77 81 4c 89 f7 e8 f9 c2 fc ff <0f> 0b 48 c7 c6 38 af 77 81 4c 89 e7 e8 e8 c2 fc ff 0f 0b 66 0f
> RIP [<ffffffff8114bf67>] split_huge_page_to_list+0x907/0x920
> RSP <ffff88004ced7a38>
> ---[ end trace c9a60522e3a296e4 ]---

I don't see how it's possible: call lock_page() just before
split_huge_page() in deferred_split_scan().

> So, I reverted all MADV_FREE patches and chaged it with MADV_DONTNEED.
> In this time, I saw below oops in this time.
> If I miss somethings, please let me know it.
>
> ------------[ cut here ]------------
> kernel BUG at include/linux/swapops.h:129!

Looks similar to what I fixed by inserting smp_wmb() just before
clear_compound_head() in __split_huge_page_tail().

Do you have this in place? Like in last -mm tree?

> Another hit:
>
> page:ffffea0000520080 count:2 mapcount:0 mapping:ffff880072b38a51 index:0x600002602
> flags: 0x4000000000048028(uptodate|lru|swapcache|swapbacked)
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> page->mem_cgroup:ffff880077cf0c00
> ------------[ cut here ]------------
> kernel BUG at mm/huge_memory.c:3306!

The same as the first one: no idea.

--
Kirill A. Shutemov

2015-11-16 10:32:24

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > During the test with MADV_FREE on kernel I applied your patches,
> > I couldn't see any problem.
> >
> > However, in this round, I did another test which is same one
> > I attached but a liitle bit different because it doesn't do
> > (memcg things/kill/swapoff) for testing program long-live test.
>
> Could you share updated test?

It's part of my testing suite so I should factor it out.
I will send it when I go to office tomorrow.

>
> And could you try to reproduce it on clean mmotm-2015-11-10-15-53?

Befor leaving office, I queued it up and result is below.
It seems you fixed already but didn't apply it to mmotm yet. Right?
Anyway, please confirm and say to me what I should add more patches
into mmotm-2015-11-10-15-53 for follow up your recent many bug
fix patches.

Thanks.

page:ffffea0000553fc0 count:3 mapcount:1 mapping:ffff88007f717a01 index:0x6000002ff
flags: 0x4000000000048019(locked|uptodate|dirty|swapcache|swapbacked)
page dumped because: VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma)
page->mem_cgroup:ffff880077cf0c00
------------[ cut here ]------------
kernel BUG at mm/migrate.c:889!
invalid opcode: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 10 PID: 59 Comm: khugepaged Not tainted 4.3.0-mm1-kirill+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880073441a40 ti: ffff88007344c000 task.ti: ffff88007344c000
RIP: 0010:[<ffffffff81145466>] [<ffffffff81145466>] migrate_pages+0x8e6/0x950
RSP: 0018:ffff88007344fa00 EFLAGS: 00010282
RAX: 0000000000000021 RBX: ffffea0001a0bbc0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
RBP: ffff88007344fa80 R08: 0000000000000000 R09: ffff8800000b9540
R10: ffffffff8163e2c0 R11: 00000000000002c2 R12: 0000000000000000
R13: ffffea0000553f80 R14: ffffea0000553fc0 R15: ffffffff8189db40
FS: 0000000000000000(0000) GS:ffff880078340000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f45cc0091d8 CR3: 000000007eba7000 CR4: 00000000000006a0
Stack:
ffff880073441a40 0000000000000000 0000000000000000 0000000000000000
ffffffff81114880 0000000000000000 ffffffff81116420 ffffea0000553fe0
ffff88007344fb30 ffff88007344fb20 0000000000000000 ffff88007344fb20
Call Trace:
[<ffffffff81114880>] ? trace_raw_output_mm_compaction_defer_template+0xc0/0xc0
[<ffffffff81116420>] ? isolate_freepages_block+0x3d0/0x3d0
[<ffffffff81116dfb>] compact_zone+0x2bb/0x720
[<ffffffff8128793d>] ? list_del+0xd/0x30
[<ffffffff811172cd>] compact_zone_order+0x6d/0xa0
[<ffffffff8111751d>] try_to_compact_pages+0xed/0x200
[<ffffffff81154143>] __alloc_pages_direct_compact+0x3b/0xd4
[<ffffffff810f921b>] __alloc_pages_nodemask+0x3fb/0x920
[<ffffffff81147465>] khugepaged+0x155/0x1b10
[<ffffffff81073ca0>] ? prepare_to_wait_event+0xf0/0xf0
[<ffffffff81147310>] ? __split_huge_pmd_locked+0x4e0/0x4e0
[<ffffffff81057e49>] kthread+0xc9/0xe0
[<ffffffff81057d80>] ? kthread_park+0x60/0x60
[<ffffffff8142aa6f>] ret_from_fork+0x3f/0x70
[<ffffffff81057d80>] ? kthread_park+0x60/0x60
Code: 44 c6 48 8b 40 08 83 e0 03 48 83 f8 03 0f 84 fd fa ff ff 4d 85 e4 0f 85 f4 fa ff ff 48 c7 c6 b8 f6 77 81 4c 89 f7 e8 fa 36 fd ff <0f> 0b 48 83 e8 01 e9 d0 fa ff ff f6 40 07 01 0f 84 5b fd ff ff
RIP [<ffffffff81145466>] migrate_pages+0x8e6/0x950
RSP <ffff88007344fa00>
---[ end trace 337555313b7e45be ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled

2015-11-16 10:54:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Mon, Nov 16, 2015 at 07:32:20PM +0900, Minchan Kim wrote:
> On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> > On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > > During the test with MADV_FREE on kernel I applied your patches,
> > > I couldn't see any problem.
> > >
> > > However, in this round, I did another test which is same one
> > > I attached but a liitle bit different because it doesn't do
> > > (memcg things/kill/swapoff) for testing program long-live test.
> >
> > Could you share updated test?
>
> It's part of my testing suite so I should factor it out.
> I will send it when I go to office tomorrow.

Thanks.

> > And could you try to reproduce it on clean mmotm-2015-11-10-15-53?
>
> Befor leaving office, I queued it up and result is below.
> It seems you fixed already but didn't apply it to mmotm yet. Right?
> Anyway, please confirm and say to me what I should add more patches
> into mmotm-2015-11-10-15-53 for follow up your recent many bug
> fix patches.

The two my patches which are not in the mmotm-2015-11-10-15-53 release:

http://lkml.kernel.org/g/1447236557-68682-1-git-send-email-kirill.shutemov@linux.intel.com
http://lkml.kernel.org/g/1447236567-68751-1-git-send-email-kirill.shutemov@linux.intel.com

--
Kirill A. Shutemov

2015-11-17 07:35:42

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Mon, Nov 16, 2015 at 12:54:53PM +0200, Kirill A. Shutemov wrote:
> On Mon, Nov 16, 2015 at 07:32:20PM +0900, Minchan Kim wrote:
> > On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> > > On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > > > During the test with MADV_FREE on kernel I applied your patches,
> > > > I couldn't see any problem.
> > > >
> > > > However, in this round, I did another test which is same one
> > > > I attached but a liitle bit different because it doesn't do
> > > > (memcg things/kill/swapoff) for testing program long-live test.
> > >
> > > Could you share updated test?
> >
> > It's part of my testing suite so I should factor it out.
> > I will send it when I go to office tomorrow.
>
> Thanks.
>
> > > And could you try to reproduce it on clean mmotm-2015-11-10-15-53?
> >
> > Befor leaving office, I queued it up and result is below.
> > It seems you fixed already but didn't apply it to mmotm yet. Right?
> > Anyway, please confirm and say to me what I should add more patches
> > into mmotm-2015-11-10-15-53 for follow up your recent many bug
> > fix patches.
>
> The two my patches which are not in the mmotm-2015-11-10-15-53 release:
>
> http://lkml.kernel.org/g/1447236557-68682-1-git-send-email-kirill.shutemov@linux.intel.com
> http://lkml.kernel.org/g/1447236567-68751-1-git-send-email-kirill.shutemov@linux.intel.com

1. mm: fix __page_mapcount()
2. thp: fix leak due split_huge_page() vs. exit race

If I missed some patches, let me know it.

I applied above two patches based on mmotm-2015-11-10-15-53 and tested again.
But unfortunately, the result was below.

Now, I am making test program I can send to you but it seems to be not easy
because small changes for factoring it out from testing suite seems to change
something(ex, timing) and makes hard to reproduce. I will try it again.


page:ffffea0000240080 count:2 mapcount:1 mapping:ffff88007eff3321 index:0x600000e02
flags: 0x4000000000040018(uptodate|dirty|swapbacked)
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
page->mem_cgroup:ffff880077cf0c00
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:3272!
invalid opcode: 0000 [#1] SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 8 PID: 59 Comm: khugepaged Not tainted 4.3.0-mm1-kirill+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880073441a40 ti: ffff88007344c000 task.ti: ffff88007344c000
RIP: 0010:[<ffffffff8114bc9b>] [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
RSP: 0018:ffff88007344f968 EFLAGS: 00010286
RAX: 0000000000000021 RBX: ffffea0000240080 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
RBP: ffff88007344f9e8 R08: 0000000000000000 R09: ffff8800000bc600
R10: ffffffff8163e2c0 R11: 0000000000004b47 R12: ffffea0000240080
R13: ffffea0000240088 R14: ffffea0000240080 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880078300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffd59edcd68 CR3: 0000000001808000 CR4: 00000000000006a0
Stack:
cccccccccccccccd ffffea0000240080 ffff88007344fa00 ffffea0000240088
ffff88007344fa00 0000000000000000 ffff88007344f9e8 ffffffff810f0200
ffffea0000240000 0000000000000000 0000000000000000 ffffea0000240080
Call Trace:
[<ffffffff810f0200>] ? __lock_page+0xa0/0xb0
[<ffffffff8114bdc5>] deferred_split_scan+0x115/0x240
[<ffffffff8111851c>] ? list_lru_count_one+0x1c/0x30
[<ffffffff811018d3>] shrink_slab.part.42+0x1e3/0x350
[<ffffffff8110644a>] shrink_zone+0x26a/0x280
[<ffffffff8110658d>] do_try_to_free_pages+0x12d/0x3b0
[<ffffffff811068c4>] try_to_free_pages+0xb4/0x140
[<ffffffff810f9279>] __alloc_pages_nodemask+0x459/0x920
[<ffffffff8108d750>] ? trace_event_raw_event_tick_stop+0xd0/0xd0
[<ffffffff81147465>] khugepaged+0x155/0x1b10
[<ffffffff81073ca0>] ? prepare_to_wait_event+0xf0/0xf0
[<ffffffff81147310>] ? __split_huge_pmd_locked+0x4e0/0x4e0
[<ffffffff81057e49>] kthread+0xc9/0xe0
[<ffffffff81057d80>] ? kthread_park+0x60/0x60
[<ffffffff8142aa6f>] ret_from_fork+0x3f/0x70
[<ffffffff81057d80>] ? kthread_park+0x60/0x60
Code: ff ff 48 c7 c6 00 cd 77 81 4c 89 f7 e8 df ce fc ff 0f 0b 48 83 e8 01 e9 94 f7 ff ff 48 c7 c6 80 bb 77 81 4c 89 f7 e8 c5 ce fc ff <0f> 0b 48 c7 c6 48 c9 77 81 4c 89 e7 e8 b4 ce fc ff 0f 0b 66 90
RIP [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
RSP <ffff88007344f968>
---[ end trace 0ee39378e850d8de ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled

2015-11-17 09:32:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Tue, Nov 17, 2015 at 04:35:39PM +0900, Minchan Kim wrote:
> On Mon, Nov 16, 2015 at 12:54:53PM +0200, Kirill A. Shutemov wrote:
> > On Mon, Nov 16, 2015 at 07:32:20PM +0900, Minchan Kim wrote:
> > > On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> > > > On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > > > > During the test with MADV_FREE on kernel I applied your patches,
> > > > > I couldn't see any problem.
> > > > >
> > > > > However, in this round, I did another test which is same one
> > > > > I attached but a liitle bit different because it doesn't do
> > > > > (memcg things/kill/swapoff) for testing program long-live test.
> > > >
> > > > Could you share updated test?
> > >
> > > It's part of my testing suite so I should factor it out.
> > > I will send it when I go to office tomorrow.
> >
> > Thanks.
> >
> > > > And could you try to reproduce it on clean mmotm-2015-11-10-15-53?
> > >
> > > Befor leaving office, I queued it up and result is below.
> > > It seems you fixed already but didn't apply it to mmotm yet. Right?
> > > Anyway, please confirm and say to me what I should add more patches
> > > into mmotm-2015-11-10-15-53 for follow up your recent many bug
> > > fix patches.
> >
> > The two my patches which are not in the mmotm-2015-11-10-15-53 release:
> >
> > http://lkml.kernel.org/g/1447236557-68682-1-git-send-email-kirill.shutemov@linux.intel.com
> > http://lkml.kernel.org/g/1447236567-68751-1-git-send-email-kirill.shutemov@linux.intel.com
>
> 1. mm: fix __page_mapcount()
> 2. thp: fix leak due split_huge_page() vs. exit race
>
> If I missed some patches, let me know it.
>
> I applied above two patches based on mmotm-2015-11-10-15-53 and tested again.
> But unfortunately, the result was below.
>
> Now, I am making test program I can send to you but it seems to be not easy
> because small changes for factoring it out from testing suite seems to change
> something(ex, timing) and makes hard to reproduce. I will try it again.

Your test suite seems generate quite a few bug reports. Don't mind make whole
suite public?

> page:ffffea0000240080 count:2 mapcount:1 mapping:ffff88007eff3321 index:0x600000e02
> flags: 0x4000000000040018(uptodate|dirty|swapbacked)
> page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> page->mem_cgroup:ffff880077cf0c00
> ------------[ cut here ]------------
> kernel BUG at mm/huge_memory.c:3272!
> invalid opcode: 0000 [#1] SMP
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 8 PID: 59 Comm: khugepaged Not tainted 4.3.0-mm1-kirill+ #8
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: ffff880073441a40 ti: ffff88007344c000 task.ti: ffff88007344c000
> RIP: 0010:[<ffffffff8114bc9b>] [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> RSP: 0018:ffff88007344f968 EFLAGS: 00010286
> RAX: 0000000000000021 RBX: ffffea0000240080 RCX: 0000000000000000
> RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
> RBP: ffff88007344f9e8 R08: 0000000000000000 R09: ffff8800000bc600
> R10: ffffffff8163e2c0 R11: 0000000000004b47 R12: ffffea0000240080
> R13: ffffea0000240088 R14: ffffea0000240080 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff880078300000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007ffd59edcd68 CR3: 0000000001808000 CR4: 00000000000006a0
> Stack:
> cccccccccccccccd ffffea0000240080 ffff88007344fa00 ffffea0000240088
> ffff88007344fa00 0000000000000000 ffff88007344f9e8 ffffffff810f0200
> ffffea0000240000 0000000000000000 0000000000000000 ffffea0000240080
> Call Trace:
> [<ffffffff810f0200>] ? __lock_page+0xa0/0xb0
> [<ffffffff8114bdc5>] deferred_split_scan+0x115/0x240
> [<ffffffff8111851c>] ? list_lru_count_one+0x1c/0x30
> [<ffffffff811018d3>] shrink_slab.part.42+0x1e3/0x350
> [<ffffffff8110644a>] shrink_zone+0x26a/0x280
> [<ffffffff8110658d>] do_try_to_free_pages+0x12d/0x3b0
> [<ffffffff811068c4>] try_to_free_pages+0xb4/0x140
> [<ffffffff810f9279>] __alloc_pages_nodemask+0x459/0x920
> [<ffffffff8108d750>] ? trace_event_raw_event_tick_stop+0xd0/0xd0
> [<ffffffff81147465>] khugepaged+0x155/0x1b10
> [<ffffffff81073ca0>] ? prepare_to_wait_event+0xf0/0xf0
> [<ffffffff81147310>] ? __split_huge_pmd_locked+0x4e0/0x4e0
> [<ffffffff81057e49>] kthread+0xc9/0xe0
> [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> [<ffffffff8142aa6f>] ret_from_fork+0x3f/0x70
> [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> Code: ff ff 48 c7 c6 00 cd 77 81 4c 89 f7 e8 df ce fc ff 0f 0b 48 83 e8 01 e9 94 f7 ff ff 48 c7 c6 80 bb 77 81 4c 89 f7 e8 c5 ce fc ff <0f> 0b 48 c7 c6 48 c9 77 81 4c 89 e7 e8 b4 ce fc ff 0f 0b 66 90
> RIP [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> RSP <ffff88007344f968>
> ---[ end trace 0ee39378e850d8de ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Kernel Offset: disabled

I looked more into it. It seems a race between split_huge_page() and
deferred_split_scan() as the dumped page is not huge.

Could you check if the patch below makes any difference to the situation?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91e2f4b7ca39..923c0f6eb50a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3186,13 +3186,6 @@ static void __split_huge_page(struct page *page, struct list_head *list)
spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(head, zone);

- spin_lock(&split_queue_lock);
- if (!list_empty(page_deferred_list(head))) {
- split_queue_len--;
- list_del(page_deferred_list(head));
- }
- spin_unlock(&split_queue_lock);
-
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);

@@ -3299,12 +3292,20 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
freeze_page(anon_vma, head);
VM_BUG_ON_PAGE(compound_mapcount(head), head);

+ /* Prevent deferred_split_scan() touching ->_count */
+ spin_lock(&split_queue_lock);
count = page_count(head);
mapcount = total_mapcount(head);
if (mapcount == count - 1) {
+ if (!list_empty(page_deferred_list(head))) {
+ split_queue_len--;
+ list_del(page_deferred_list(head));
+ }
+ spin_unlock(&split_queue_lock);
__split_huge_page(page, list);
ret = 0;
} else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) {
+ spin_unlock(&split_queue_lock);
pr_alert("total_mapcount: %u, page_count(): %u\n",
mapcount, count);
if (PageTail(page))
@@ -3312,6 +3313,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
dump_page(page, "total_mapcount(head) > page_count(head) - 1");
BUG();
} else {
+ spin_unlock(&split_queue_lock);
unfreeze_page(anon_vma, head);
ret = -EBUSY;
}
--
Kirill A. Shutemov

2015-11-19 02:12:21

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Tue, Nov 17, 2015 at 11:32:13AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 17, 2015 at 04:35:39PM +0900, Minchan Kim wrote:
> > On Mon, Nov 16, 2015 at 12:54:53PM +0200, Kirill A. Shutemov wrote:
> > > On Mon, Nov 16, 2015 at 07:32:20PM +0900, Minchan Kim wrote:
> > > > On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> > > > > On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > > > > > During the test with MADV_FREE on kernel I applied your patches,
> > > > > > I couldn't see any problem.
> > > > > >
> > > > > > However, in this round, I did another test which is same one
> > > > > > I attached but a liitle bit different because it doesn't do
> > > > > > (memcg things/kill/swapoff) for testing program long-live test.
> > > > >
> > > > > Could you share updated test?
> > > >
> > > > It's part of my testing suite so I should factor it out.
> > > > I will send it when I go to office tomorrow.
> > >
> > > Thanks.
> > >
> > > > > And could you try to reproduce it on clean mmotm-2015-11-10-15-53?
> > > >
> > > > Befor leaving office, I queued it up and result is below.
> > > > It seems you fixed already but didn't apply it to mmotm yet. Right?
> > > > Anyway, please confirm and say to me what I should add more patches
> > > > into mmotm-2015-11-10-15-53 for follow up your recent many bug
> > > > fix patches.
> > >
> > > The two my patches which are not in the mmotm-2015-11-10-15-53 release:
> > >
> > > http://lkml.kernel.org/g/1447236557-68682-1-git-send-email-kirill.shutemov@linux.intel.com
> > > http://lkml.kernel.org/g/1447236567-68751-1-git-send-email-kirill.shutemov@linux.intel.com
> >
> > 1. mm: fix __page_mapcount()
> > 2. thp: fix leak due split_huge_page() vs. exit race
> >
> > If I missed some patches, let me know it.
> >
> > I applied above two patches based on mmotm-2015-11-10-15-53 and tested again.
> > But unfortunately, the result was below.
> >
> > Now, I am making test program I can send to you but it seems to be not easy
> > because small changes for factoring it out from testing suite seems to change
> > something(ex, timing) and makes hard to reproduce. I will try it again.
>
> Your test suite seems generate quite a few bug reports. Don't mind make whole
> suite public?

It's tough due to including company internal stuffs.
That's why I try to factor the part I can share out but unfortunatel,
I couldn't grab a time for retrying until now. :(

>
> > page:ffffea0000240080 count:2 mapcount:1 mapping:ffff88007eff3321 index:0x600000e02
> > flags: 0x4000000000040018(uptodate|dirty|swapbacked)
> > page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> > page->mem_cgroup:ffff880077cf0c00
> > ------------[ cut here ]------------
> > kernel BUG at mm/huge_memory.c:3272!
> > invalid opcode: 0000 [#1] SMP
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > Modules linked in:
> > CPU: 8 PID: 59 Comm: khugepaged Not tainted 4.3.0-mm1-kirill+ #8
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > task: ffff880073441a40 ti: ffff88007344c000 task.ti: ffff88007344c000
> > RIP: 0010:[<ffffffff8114bc9b>] [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> > RSP: 0018:ffff88007344f968 EFLAGS: 00010286
> > RAX: 0000000000000021 RBX: ffffea0000240080 RCX: 0000000000000000
> > RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
> > RBP: ffff88007344f9e8 R08: 0000000000000000 R09: ffff8800000bc600
> > R10: ffffffff8163e2c0 R11: 0000000000004b47 R12: ffffea0000240080
> > R13: ffffea0000240088 R14: ffffea0000240080 R15: 0000000000000000
> > FS: 0000000000000000(0000) GS:ffff880078300000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00007ffd59edcd68 CR3: 0000000001808000 CR4: 00000000000006a0
> > Stack:
> > cccccccccccccccd ffffea0000240080 ffff88007344fa00 ffffea0000240088
> > ffff88007344fa00 0000000000000000 ffff88007344f9e8 ffffffff810f0200
> > ffffea0000240000 0000000000000000 0000000000000000 ffffea0000240080
> > Call Trace:
> > [<ffffffff810f0200>] ? __lock_page+0xa0/0xb0
> > [<ffffffff8114bdc5>] deferred_split_scan+0x115/0x240
> > [<ffffffff8111851c>] ? list_lru_count_one+0x1c/0x30
> > [<ffffffff811018d3>] shrink_slab.part.42+0x1e3/0x350
> > [<ffffffff8110644a>] shrink_zone+0x26a/0x280
> > [<ffffffff8110658d>] do_try_to_free_pages+0x12d/0x3b0
> > [<ffffffff811068c4>] try_to_free_pages+0xb4/0x140
> > [<ffffffff810f9279>] __alloc_pages_nodemask+0x459/0x920
> > [<ffffffff8108d750>] ? trace_event_raw_event_tick_stop+0xd0/0xd0
> > [<ffffffff81147465>] khugepaged+0x155/0x1b10
> > [<ffffffff81073ca0>] ? prepare_to_wait_event+0xf0/0xf0
> > [<ffffffff81147310>] ? __split_huge_pmd_locked+0x4e0/0x4e0
> > [<ffffffff81057e49>] kthread+0xc9/0xe0
> > [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> > [<ffffffff8142aa6f>] ret_from_fork+0x3f/0x70
> > [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> > Code: ff ff 48 c7 c6 00 cd 77 81 4c 89 f7 e8 df ce fc ff 0f 0b 48 83 e8 01 e9 94 f7 ff ff 48 c7 c6 80 bb 77 81 4c 89 f7 e8 c5 ce fc ff <0f> 0b 48 c7 c6 48 c9 77 81 4c 89 e7 e8 b4 ce fc ff 0f 0b 66 90
> > RIP [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> > RSP <ffff88007344f968>
> > ---[ end trace 0ee39378e850d8de ]---
> > Kernel panic - not syncing: Fatal exception
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > Kernel Offset: disabled
>
> I looked more into it. It seems a race between split_huge_page() and
> deferred_split_scan() as the dumped page is not huge.
>
> Could you check if the patch below makes any difference to the situation?
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 91e2f4b7ca39..923c0f6eb50a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3186,13 +3186,6 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> spin_lock_irq(&zone->lru_lock);
> lruvec = mem_cgroup_page_lruvec(head, zone);
>
> - spin_lock(&split_queue_lock);
> - if (!list_empty(page_deferred_list(head))) {
> - split_queue_len--;
> - list_del(page_deferred_list(head));
> - }
> - spin_unlock(&split_queue_lock);
> -
> /* complete memcg works before add pages to LRU */
> mem_cgroup_split_huge_fixup(head);
>
> @@ -3299,12 +3292,20 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> freeze_page(anon_vma, head);
> VM_BUG_ON_PAGE(compound_mapcount(head), head);
>
> + /* Prevent deferred_split_scan() touching ->_count */
> + spin_lock(&split_queue_lock);
> count = page_count(head);
> mapcount = total_mapcount(head);
> if (mapcount == count - 1) {
> + if (!list_empty(page_deferred_list(head))) {
> + split_queue_len--;
> + list_del(page_deferred_list(head));
> + }
> + spin_unlock(&split_queue_lock);
> __split_huge_page(page, list);
> ret = 0;
> } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) {
> + spin_unlock(&split_queue_lock);
> pr_alert("total_mapcount: %u, page_count(): %u\n",
> mapcount, count);
> if (PageTail(page))
> @@ -3312,6 +3313,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> dump_page(page, "total_mapcount(head) > page_count(head) - 1");
> BUG();
> } else {
> + spin_unlock(&split_queue_lock);
> unfreeze_page(anon_vma, head);
> ret = -EBUSY;
> }
> --
> Kirill A. Shutemov
>

It seems to solve that BUG_ON. One guest which doesn't include above fix hit
the BUG_ON within 10 hours. However, another machine with above fix works
during 1 day above without the BUG_ON but it introduces new problem.

BUG: Bad rss-counter state mm:ffff88007f411c00 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff88007f411c00 idx:1 val:1

2015-11-19 06:58:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Thu, Nov 19, 2015 at 11:12:21AM +0900, Minchan Kim wrote:
> On Tue, Nov 17, 2015 at 11:32:13AM +0200, Kirill A. Shutemov wrote:
> > On Tue, Nov 17, 2015 at 04:35:39PM +0900, Minchan Kim wrote:
> > > On Mon, Nov 16, 2015 at 12:54:53PM +0200, Kirill A. Shutemov wrote:
> > > > On Mon, Nov 16, 2015 at 07:32:20PM +0900, Minchan Kim wrote:
> > > > > On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> > > > > > On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > > > > > > During the test with MADV_FREE on kernel I applied your patches,
> > > > > > > I couldn't see any problem.
> > > > > > >
> > > > > > > However, in this round, I did another test which is same one
> > > > > > > I attached but a liitle bit different because it doesn't do
> > > > > > > (memcg things/kill/swapoff) for testing program long-live test.
> > > > > >
> > > > > > Could you share updated test?
> > > > >
> > > > > It's part of my testing suite so I should factor it out.
> > > > > I will send it when I go to office tomorrow.
> > > >
> > > > Thanks.
> > > >
> > > > > > And could you try to reproduce it on clean mmotm-2015-11-10-15-53?
> > > > >
> > > > > Befor leaving office, I queued it up and result is below.
> > > > > It seems you fixed already but didn't apply it to mmotm yet. Right?
> > > > > Anyway, please confirm and say to me what I should add more patches
> > > > > into mmotm-2015-11-10-15-53 for follow up your recent many bug
> > > > > fix patches.
> > > >
> > > > The two my patches which are not in the mmotm-2015-11-10-15-53 release:
> > > >
> > > > http://lkml.kernel.org/g/1447236557-68682-1-git-send-email-kirill.shutemov@linux.intel.com
> > > > http://lkml.kernel.org/g/1447236567-68751-1-git-send-email-kirill.shutemov@linux.intel.com
> > >
> > > 1. mm: fix __page_mapcount()
> > > 2. thp: fix leak due split_huge_page() vs. exit race
> > >
> > > If I missed some patches, let me know it.
> > >
> > > I applied above two patches based on mmotm-2015-11-10-15-53 and tested again.
> > > But unfortunately, the result was below.
> > >
> > > Now, I am making test program I can send to you but it seems to be not easy
> > > because small changes for factoring it out from testing suite seems to change
> > > something(ex, timing) and makes hard to reproduce. I will try it again.
> >
> > Your test suite seems generate quite a few bug reports. Don't mind make whole
> > suite public?
>
> It's tough due to including company internal stuffs.
> That's why I try to factor the part I can share out but unfortunatel,
> I couldn't grab a time for retrying until now. :(
>
> >
> > > page:ffffea0000240080 count:2 mapcount:1 mapping:ffff88007eff3321 index:0x600000e02
> > > flags: 0x4000000000040018(uptodate|dirty|swapbacked)
> > > page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> > > page->mem_cgroup:ffff880077cf0c00
> > > ------------[ cut here ]------------
> > > kernel BUG at mm/huge_memory.c:3272!
> > > invalid opcode: 0000 [#1] SMP
> > > Dumping ftrace buffer:
> > > (ftrace buffer empty)
> > > Modules linked in:
> > > CPU: 8 PID: 59 Comm: khugepaged Not tainted 4.3.0-mm1-kirill+ #8
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > > task: ffff880073441a40 ti: ffff88007344c000 task.ti: ffff88007344c000
> > > RIP: 0010:[<ffffffff8114bc9b>] [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> > > RSP: 0018:ffff88007344f968 EFLAGS: 00010286
> > > RAX: 0000000000000021 RBX: ffffea0000240080 RCX: 0000000000000000
> > > RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
> > > RBP: ffff88007344f9e8 R08: 0000000000000000 R09: ffff8800000bc600
> > > R10: ffffffff8163e2c0 R11: 0000000000004b47 R12: ffffea0000240080
> > > R13: ffffea0000240088 R14: ffffea0000240080 R15: 0000000000000000
> > > FS: 0000000000000000(0000) GS:ffff880078300000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 00007ffd59edcd68 CR3: 0000000001808000 CR4: 00000000000006a0
> > > Stack:
> > > cccccccccccccccd ffffea0000240080 ffff88007344fa00 ffffea0000240088
> > > ffff88007344fa00 0000000000000000 ffff88007344f9e8 ffffffff810f0200
> > > ffffea0000240000 0000000000000000 0000000000000000 ffffea0000240080
> > > Call Trace:
> > > [<ffffffff810f0200>] ? __lock_page+0xa0/0xb0
> > > [<ffffffff8114bdc5>] deferred_split_scan+0x115/0x240
> > > [<ffffffff8111851c>] ? list_lru_count_one+0x1c/0x30
> > > [<ffffffff811018d3>] shrink_slab.part.42+0x1e3/0x350
> > > [<ffffffff8110644a>] shrink_zone+0x26a/0x280
> > > [<ffffffff8110658d>] do_try_to_free_pages+0x12d/0x3b0
> > > [<ffffffff811068c4>] try_to_free_pages+0xb4/0x140
> > > [<ffffffff810f9279>] __alloc_pages_nodemask+0x459/0x920
> > > [<ffffffff8108d750>] ? trace_event_raw_event_tick_stop+0xd0/0xd0
> > > [<ffffffff81147465>] khugepaged+0x155/0x1b10
> > > [<ffffffff81073ca0>] ? prepare_to_wait_event+0xf0/0xf0
> > > [<ffffffff81147310>] ? __split_huge_pmd_locked+0x4e0/0x4e0
> > > [<ffffffff81057e49>] kthread+0xc9/0xe0
> > > [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> > > [<ffffffff8142aa6f>] ret_from_fork+0x3f/0x70
> > > [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> > > Code: ff ff 48 c7 c6 00 cd 77 81 4c 89 f7 e8 df ce fc ff 0f 0b 48 83 e8 01 e9 94 f7 ff ff 48 c7 c6 80 bb 77 81 4c 89 f7 e8 c5 ce fc ff <0f> 0b 48 c7 c6 48 c9 77 81 4c 89 e7 e8 b4 ce fc ff 0f 0b 66 90
> > > RIP [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> > > RSP <ffff88007344f968>
> > > ---[ end trace 0ee39378e850d8de ]---
> > > Kernel panic - not syncing: Fatal exception
> > > Dumping ftrace buffer:
> > > (ftrace buffer empty)
> > > Kernel Offset: disabled
> >
> > I looked more into it. It seems a race between split_huge_page() and
> > deferred_split_scan() as the dumped page is not huge.
> >
> > Could you check if the patch below makes any difference to the situation?
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 91e2f4b7ca39..923c0f6eb50a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3186,13 +3186,6 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> > spin_lock_irq(&zone->lru_lock);
> > lruvec = mem_cgroup_page_lruvec(head, zone);
> >
> > - spin_lock(&split_queue_lock);
> > - if (!list_empty(page_deferred_list(head))) {
> > - split_queue_len--;
> > - list_del(page_deferred_list(head));
> > - }
> > - spin_unlock(&split_queue_lock);
> > -
> > /* complete memcg works before add pages to LRU */
> > mem_cgroup_split_huge_fixup(head);
> >
> > @@ -3299,12 +3292,20 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> > freeze_page(anon_vma, head);
> > VM_BUG_ON_PAGE(compound_mapcount(head), head);
> >
> > + /* Prevent deferred_split_scan() touching ->_count */
> > + spin_lock(&split_queue_lock);
> > count = page_count(head);
> > mapcount = total_mapcount(head);
> > if (mapcount == count - 1) {
> > + if (!list_empty(page_deferred_list(head))) {
> > + split_queue_len--;
> > + list_del(page_deferred_list(head));
> > + }
> > + spin_unlock(&split_queue_lock);
> > __split_huge_page(page, list);
> > ret = 0;
> > } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) {
> > + spin_unlock(&split_queue_lock);
> > pr_alert("total_mapcount: %u, page_count(): %u\n",
> > mapcount, count);
> > if (PageTail(page))
> > @@ -3312,6 +3313,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> > dump_page(page, "total_mapcount(head) > page_count(head) - 1");
> > BUG();
> > } else {
> > + spin_unlock(&split_queue_lock);
> > unfreeze_page(anon_vma, head);
> > ret = -EBUSY;
> > }
> > --
> > Kirill A. Shutemov
> >
>
> It seems to solve that BUG_ON. One guest which doesn't include above fix hit
> the BUG_ON within 10 hours. However, another machine with above fix works
> during 1 day above without the BUG_ON but it introduces new problem.
>
> BUG: Bad rss-counter state mm:ffff88007f411c00 idx:0 val:-1
> BUG: Bad rss-counter state mm:ffff88007f411c00 idx:1 val:1

That's rather strange: looks like one file page was charged as anon or
one anon page was uncharged as file. Not sure yet how this can be caused
by my THP patchset :/

--
Kirill A. Shutemov

2015-11-19 10:10:48

by yalin wang

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20


> On Nov 19, 2015, at 14:58, Kirill A. Shutemov <[email protected]> wrote:
>
> uncharged
i also encounter this crash ,

also i encounter a crash like this in qemu:


[ 2.703436] [<ffffffc0001d4d2c>] do_execveat_common.isra.36+0x4f0/0x630
[ 2.703624] [<ffffffc0001d4e90>] do_execve+0x24/0x30
[ 2.703767] [<ffffffc0001d50e0>] SyS_execve+0x1c/0x2c
[ 2.703923] BUG: Bad page map in process init pte:6000004837ebd3 pmd:b29e7003
[ 2.704140] page:ffffffc07f00af80 count:2 mapcount:-1 mapping: (null) index:0x1
[ 2.704414] flags: 0x400000000014(referenced|dirty)
[ 2.704563] page dumped because: bad pte
[ 2.704666] addr:0000007fafb7e000 vm_flags:00100073 anon_vma:ffffffc0729bdb90 mapping: (null) index:7fafb7e
[ 2.704906] file: (null) fault: (null) mmap: (null) readpage: (null)
[ 2.705117] CPU: 0 PID: 84 Comm: init Tainted: G B 4.2.0ajb-00005-g11a9bf3 #80
[ 2.705315] Hardware name: ranchu (DT)
[ 2.705408] Call trace:
[ 2.705488] [<ffffffc000089ea0>] dump_backtrace+0x0/0x124
[ 2.705657] [<ffffffc000089fd4>] show_stack+0x10/0x1c
[ 2.705797] [<ffffffc0005f1df0>] dump_stack+0x78/0x98
[ 2.705971] [<ffffffc00018a8d4>] print_bad_pte+0x154/0x1f0
[ 2.706102] [<ffffffc00018c5f4>] unmap_single_vma+0x574/0x704
[ 2.706236] [<ffffffc00018d0a4>] unmap_vmas+0x54/0x70
[ 2.706354] [<ffffffc000195e70>] exit_mmap+0x88/0xfc
[ 2.706473] [<ffffffc000097af4>] mmput+0x48/0xe8
[ 2.706584] [<ffffffc0001d3b64>] flush_old_exec+0x30c/0x79c
[ 2.706719] [<ffffffc000225fa4>] load_elf_binary+0x21c/0x1098
[ 2.706856] [<ffffffc0001d4330>] search_binary_handler+0xa8/0x224
[ 2.706995] [<ffffffc0001d4d2c>] do_execveat_common.isra.36+0x4f0/0x630
[ 2.707144] [<ffffffc0001d4e90>] do_execve+0x24/0x30
[ 2.707263] [<ffffffc0001d50e0>] SyS_execve+0x1c/0x2c
[ 2.707392] BUG: Bad page map in process init pte:6000004837fbd3 pmd:b29e7003
[ 2.707752] page:ffffffc07f00afc0 count:2 mapcount:-1 mapping: (null) index:0x1
[ 2.708167] flags: 0x400000000014(referenced|dirty)
[ 2.708333] page dumped because: bad pte
[ 2.708501] addr:0000007fafb7f000 vm_flags:00100073 anon_vma:ffffffc0729bdb90 mapping: (null) index:7fafb7f
[ 2.709084] file: (null) fault: (null) mmap: (null) readpage: (null)
[ 2.709306] CPU: 0 PID: 84 Comm: init Tainted: G B 4.2.0ajb-00005-g11a9bf3 #80
[ 2.709494] Hardware name: ranchu (DT)

seems the page map count is not correct ..
i build is based on mmotm-2015-10-21-14-41

Thanks


2015-11-25 07:21:28

by Minchan Kim

[permalink] [raw]
Subject: Re: kernel oops on mmotm-2015-10-15-15-20

On Thu, Nov 19, 2015 at 08:58:27AM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 19, 2015 at 11:12:21AM +0900, Minchan Kim wrote:
> > On Tue, Nov 17, 2015 at 11:32:13AM +0200, Kirill A. Shutemov wrote:
> > > On Tue, Nov 17, 2015 at 04:35:39PM +0900, Minchan Kim wrote:
> > > > On Mon, Nov 16, 2015 at 12:54:53PM +0200, Kirill A. Shutemov wrote:
> > > > > On Mon, Nov 16, 2015 at 07:32:20PM +0900, Minchan Kim wrote:
> > > > > > On Mon, Nov 16, 2015 at 10:45:22AM +0200, Kirill A. Shutemov wrote:
> > > > > > > On Mon, Nov 16, 2015 at 10:45:21AM +0900, Minchan Kim wrote:
> > > > > > > > During the test with MADV_FREE on kernel I applied your patches,
> > > > > > > > I couldn't see any problem.
> > > > > > > >
> > > > > > > > However, in this round, I did another test which is same one
> > > > > > > > I attached but a liitle bit different because it doesn't do
> > > > > > > > (memcg things/kill/swapoff) for testing program long-live test.
> > > > > > >
> > > > > > > Could you share updated test?
> > > > > >
> > > > > > It's part of my testing suite so I should factor it out.
> > > > > > I will send it when I go to office tomorrow.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > > > And could you try to reproduce it on clean mmotm-2015-11-10-15-53?
> > > > > >
> > > > > > Befor leaving office, I queued it up and result is below.
> > > > > > It seems you fixed already but didn't apply it to mmotm yet. Right?
> > > > > > Anyway, please confirm and say to me what I should add more patches
> > > > > > into mmotm-2015-11-10-15-53 for follow up your recent many bug
> > > > > > fix patches.
> > > > >
> > > > > The two my patches which are not in the mmotm-2015-11-10-15-53 release:
> > > > >
> > > > > http://lkml.kernel.org/g/1447236557-68682-1-git-send-email-kirill.shutemov@linux.intel.com
> > > > > http://lkml.kernel.org/g/1447236567-68751-1-git-send-email-kirill.shutemov@linux.intel.com
> > > >
> > > > 1. mm: fix __page_mapcount()
> > > > 2. thp: fix leak due split_huge_page() vs. exit race
> > > >
> > > > If I missed some patches, let me know it.
> > > >
> > > > I applied above two patches based on mmotm-2015-11-10-15-53 and tested again.
> > > > But unfortunately, the result was below.
> > > >
> > > > Now, I am making test program I can send to you but it seems to be not easy
> > > > because small changes for factoring it out from testing suite seems to change
> > > > something(ex, timing) and makes hard to reproduce. I will try it again.
> > >
> > > Your test suite seems generate quite a few bug reports. Don't mind make whole
> > > suite public?
> >
> > It's tough due to including company internal stuffs.
> > That's why I try to factor the part I can share out but unfortunatel,
> > I couldn't grab a time for retrying until now. :(
> >
> > >
> > > > page:ffffea0000240080 count:2 mapcount:1 mapping:ffff88007eff3321 index:0x600000e02
> > > > flags: 0x4000000000040018(uptodate|dirty|swapbacked)
> > > > page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
> > > > page->mem_cgroup:ffff880077cf0c00
> > > > ------------[ cut here ]------------
> > > > kernel BUG at mm/huge_memory.c:3272!
> > > > invalid opcode: 0000 [#1] SMP
> > > > Dumping ftrace buffer:
> > > > (ftrace buffer empty)
> > > > Modules linked in:
> > > > CPU: 8 PID: 59 Comm: khugepaged Not tainted 4.3.0-mm1-kirill+ #8
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > > > task: ffff880073441a40 ti: ffff88007344c000 task.ti: ffff88007344c000
> > > > RIP: 0010:[<ffffffff8114bc9b>] [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> > > > RSP: 0018:ffff88007344f968 EFLAGS: 00010286
> > > > RAX: 0000000000000021 RBX: ffffea0000240080 RCX: 0000000000000000
> > > > RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffff821df4d8
> > > > RBP: ffff88007344f9e8 R08: 0000000000000000 R09: ffff8800000bc600
> > > > R10: ffffffff8163e2c0 R11: 0000000000004b47 R12: ffffea0000240080
> > > > R13: ffffea0000240088 R14: ffffea0000240080 R15: 0000000000000000
> > > > FS: 0000000000000000(0000) GS:ffff880078300000(0000) knlGS:0000000000000000
> > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > CR2: 00007ffd59edcd68 CR3: 0000000001808000 CR4: 00000000000006a0
> > > > Stack:
> > > > cccccccccccccccd ffffea0000240080 ffff88007344fa00 ffffea0000240088
> > > > ffff88007344fa00 0000000000000000 ffff88007344f9e8 ffffffff810f0200
> > > > ffffea0000240000 0000000000000000 0000000000000000 ffffea0000240080
> > > > Call Trace:
> > > > [<ffffffff810f0200>] ? __lock_page+0xa0/0xb0
> > > > [<ffffffff8114bdc5>] deferred_split_scan+0x115/0x240
> > > > [<ffffffff8111851c>] ? list_lru_count_one+0x1c/0x30
> > > > [<ffffffff811018d3>] shrink_slab.part.42+0x1e3/0x350
> > > > [<ffffffff8110644a>] shrink_zone+0x26a/0x280
> > > > [<ffffffff8110658d>] do_try_to_free_pages+0x12d/0x3b0
> > > > [<ffffffff811068c4>] try_to_free_pages+0xb4/0x140
> > > > [<ffffffff810f9279>] __alloc_pages_nodemask+0x459/0x920
> > > > [<ffffffff8108d750>] ? trace_event_raw_event_tick_stop+0xd0/0xd0
> > > > [<ffffffff81147465>] khugepaged+0x155/0x1b10
> > > > [<ffffffff81073ca0>] ? prepare_to_wait_event+0xf0/0xf0
> > > > [<ffffffff81147310>] ? __split_huge_pmd_locked+0x4e0/0x4e0
> > > > [<ffffffff81057e49>] kthread+0xc9/0xe0
> > > > [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> > > > [<ffffffff8142aa6f>] ret_from_fork+0x3f/0x70
> > > > [<ffffffff81057d80>] ? kthread_park+0x60/0x60
> > > > Code: ff ff 48 c7 c6 00 cd 77 81 4c 89 f7 e8 df ce fc ff 0f 0b 48 83 e8 01 e9 94 f7 ff ff 48 c7 c6 80 bb 77 81 4c 89 f7 e8 c5 ce fc ff <0f> 0b 48 c7 c6 48 c9 77 81 4c 89 e7 e8 b4 ce fc ff 0f 0b 66 90
> > > > RIP [<ffffffff8114bc9b>] split_huge_page_to_list+0x8fb/0x910
> > > > RSP <ffff88007344f968>
> > > > ---[ end trace 0ee39378e850d8de ]---
> > > > Kernel panic - not syncing: Fatal exception
> > > > Dumping ftrace buffer:
> > > > (ftrace buffer empty)
> > > > Kernel Offset: disabled
> > >
> > > I looked more into it. It seems a race between split_huge_page() and
> > > deferred_split_scan() as the dumped page is not huge.
> > >
> > > Could you check if the patch below makes any difference to the situation?
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 91e2f4b7ca39..923c0f6eb50a 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -3186,13 +3186,6 @@ static void __split_huge_page(struct page *page, struct list_head *list)
> > > spin_lock_irq(&zone->lru_lock);
> > > lruvec = mem_cgroup_page_lruvec(head, zone);
> > >
> > > - spin_lock(&split_queue_lock);
> > > - if (!list_empty(page_deferred_list(head))) {
> > > - split_queue_len--;
> > > - list_del(page_deferred_list(head));
> > > - }
> > > - spin_unlock(&split_queue_lock);
> > > -
> > > /* complete memcg works before add pages to LRU */
> > > mem_cgroup_split_huge_fixup(head);
> > >
> > > @@ -3299,12 +3292,20 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> > > freeze_page(anon_vma, head);
> > > VM_BUG_ON_PAGE(compound_mapcount(head), head);
> > >
> > > + /* Prevent deferred_split_scan() touching ->_count */
> > > + spin_lock(&split_queue_lock);
> > > count = page_count(head);
> > > mapcount = total_mapcount(head);
> > > if (mapcount == count - 1) {
> > > + if (!list_empty(page_deferred_list(head))) {
> > > + split_queue_len--;
> > > + list_del(page_deferred_list(head));
> > > + }
> > > + spin_unlock(&split_queue_lock);
> > > __split_huge_page(page, list);
> > > ret = 0;
> > > } else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) {
> > > + spin_unlock(&split_queue_lock);
> > > pr_alert("total_mapcount: %u, page_count(): %u\n",
> > > mapcount, count);
> > > if (PageTail(page))
> > > @@ -3312,6 +3313,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> > > dump_page(page, "total_mapcount(head) > page_count(head) - 1");
> > > BUG();
> > > } else {
> > > + spin_unlock(&split_queue_lock);
> > > unfreeze_page(anon_vma, head);
> > > ret = -EBUSY;
> > > }
> > > --
> > > Kirill A. Shutemov
> > >
> >
> > It seems to solve that BUG_ON. One guest which doesn't include above fix hit
> > the BUG_ON within 10 hours. However, another machine with above fix works
> > during 1 day above without the BUG_ON but it introduces new problem.
> >
> > BUG: Bad rss-counter state mm:ffff88007f411c00 idx:0 val:-1
> > BUG: Bad rss-counter state mm:ffff88007f411c00 idx:1 val:1
>
> That's rather strange: looks like one file page was charged as anon or
> one anon page was uncharged as file. Not sure yet how this can be caused
> by my THP patchset :/

I couldn't reproduce this problem in another test for a week and the test
doesn't have any problem until now.

Thanks.