LinuxLists.cc - [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> The commit "arm64: switch to generic version of pte allocation"
> introduced endless failures during boot like,
>
> kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> -2 parent: cgroup)
>
> It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> and then later memcg finds out those don't belong to any cgroup.

Mike, I understood from [1] that this wasn't expected to be a problem,
as the accounting should bypass kernel threads.

Was that assumption wrong, or is something different happening here?

>
> backtrace:
> kobject_add_internal
> kobject_init_and_add
> sysfs_slab_add+0x1a8
> __kmem_cache_create
> create_cache
> memcg_create_kmem_cache
> memcg_kmem_cache_create_func
> process_one_work
> worker_thread
> kthread
>
> Signed-off-by: Qian Cai <[email protected]>
> ---
> arch/arm64/mm/pgd.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> index 769516cb6677..53c48f5c8765 100644
> --- a/arch/arm64/mm/pgd.c
> +++ b/arch/arm64/mm/pgd.c
> @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> if (PGD_SIZE == PAGE_SIZE)
> return (pgd_t *)__get_free_page(gfp);
> else
> - return kmem_cache_alloc(pgd_cache, gfp);
> + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);

This is used to allocate PGDs for both user and kernel pagetables (e.g.
for the efi runtime services), so while this may fix the regression, I'm
not sure it's the right fix.

Do we need a separate pgd_alloc_kernel()?

Thanks,
Mark.

[1] https://lkml.kernel.org/r/20190505061956.GE15755@rapoport-lnx

2019-06-04 14:33:26

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > The commit "arm64: switch to generic version of pte allocation"
> > introduced endless failures during boot like,
> >
> > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > -2 parent: cgroup)
> >
> > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > and then later memcg finds out those don't belong to any cgroup.
>
> Mike, I understood from [1] that this wasn't expected to be a problem,
> as the accounting should bypass kernel threads.
>
> Was that assumption wrong, or is something different happening here?
>
> > backtrace:
> > kobject_add_internal
> > kobject_init_and_add
> > sysfs_slab_add+0x1a8
> > __kmem_cache_create
> > create_cache
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> >
> > Signed-off-by: Qian Cai <[email protected]>
> > ---
> > arch/arm64/mm/pgd.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > index 769516cb6677..53c48f5c8765 100644
> > --- a/arch/arm64/mm/pgd.c
> > +++ b/arch/arm64/mm/pgd.c
> > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > if (PGD_SIZE == PAGE_SIZE)
> > return (pgd_t *)__get_free_page(gfp);
> > else
> > - return kmem_cache_alloc(pgd_cache, gfp);
> > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
>
> This is used to allocate PGDs for both user and kernel pagetables (e.g.
> for the efi runtime services), so while this may fix the regression, I'm
> not sure it's the right fix.

I see that since [1], pgd_alloc() was updated to special-case the
init_mm, which is not sufficient for cases like:

efi_mm.pgd = pgd_alloc(&efi_mm)

... which occurs in a kthread.

So let's have a pgd_alloc_kernel() to make that explicit.

Thanks,
Mark.

2019-06-04 14:56:24

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > The commit "arm64: switch to generic version of pte allocation"
> > introduced endless failures during boot like,
> >
> > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > -2 parent: cgroup)
> >
> > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > and then later memcg finds out those don't belong to any cgroup.
>
> Mike, I understood from [1] that this wasn't expected to be a problem,
> as the accounting should bypass kernel threads.
>
> Was that assumption wrong, or is something different happening here?

I was under impression that all allocations are going through
__memcg_kmem_charge() which does the bypass.

Apparently, it's not the case :(

> >
> > backtrace:
> > kobject_add_internal
> > kobject_init_and_add
> > sysfs_slab_add+0x1a8
> > __kmem_cache_create
> > create_cache
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> >
> > Signed-off-by: Qian Cai <[email protected]>
> > ---
> > arch/arm64/mm/pgd.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > index 769516cb6677..53c48f5c8765 100644
> > --- a/arch/arm64/mm/pgd.c
> > +++ b/arch/arm64/mm/pgd.c
> > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > if (PGD_SIZE == PAGE_SIZE)
> > return (pgd_t *)__get_free_page(gfp);
> > else
> > - return kmem_cache_alloc(pgd_cache, gfp);
> > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
>
> This is used to allocate PGDs for both user and kernel pagetables (e.g.
> for the efi runtime services), so while this may fix the regression, I'm
> not sure it's the right fix.

Me neither.

> Do we need a separate pgd_alloc_kernel()?

I'd like to take a closer look at memcg paths once again before adding
pgd_alloc_kernel().

Johannes, Roman, can you please advise anything?

> Thanks,
> Mark.
>
> [1] https://lkml.kernel.org/r/20190505061956.GE15755@rapoport-lnx
>

--
Sincerely yours,
Mike.

2019-06-05 21:35:23

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 04, 2019 at 03:30:20PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > The commit "arm64: switch to generic version of pte allocation"
> > > introduced endless failures during boot like,
> > >
> > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > -2 parent: cgroup)
> > >
> > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > and then later memcg finds out those don't belong to any cgroup.
> >
> > Mike, I understood from [1] that this wasn't expected to be a problem,
> > as the accounting should bypass kernel threads.
> >
> > Was that assumption wrong, or is something different happening here?
> >
> > > backtrace:
> > > kobject_add_internal
> > > kobject_init_and_add
> > > sysfs_slab_add+0x1a8
> > > __kmem_cache_create
> > > create_cache
> > > memcg_create_kmem_cache
> > > memcg_kmem_cache_create_func
> > > process_one_work
> > > worker_thread
> > > kthread
> > >
> > > Signed-off-by: Qian Cai <[email protected]>
> > > ---
> > > arch/arm64/mm/pgd.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > index 769516cb6677..53c48f5c8765 100644
> > > --- a/arch/arm64/mm/pgd.c
> > > +++ b/arch/arm64/mm/pgd.c
> > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > if (PGD_SIZE == PAGE_SIZE)
> > > return (pgd_t *)__get_free_page(gfp);
> > > else
> > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> >
> > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > for the efi runtime services), so while this may fix the regression, I'm
> > not sure it's the right fix.
>
> I see that since [1], pgd_alloc() was updated to special-case the
> init_mm, which is not sufficient for cases like:
>
> efi_mm.pgd = pgd_alloc(&efi_mm)
>
> ... which occurs in a kthread.
>
> So let's have a pgd_alloc_kernel() to make that explicit.

I've hit "send" before seeing this one :)

Well, to be completely on the safe side an explicit pgd_alloc_kernel()
sounds right. Then it won't be subject to future changes in memcg and will
always "Do The Right Thing".

> Thanks,
> Mark.
>

--
Sincerely yours,
Mike.

2019-06-10 12:16:10

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > The commit "arm64: switch to generic version of pte allocation"
> > introduced endless failures during boot like,
> >
> > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > -2 parent: cgroup)
> >
> > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > and then later memcg finds out those don't belong to any cgroup.
>
> Mike, I understood from [1] that this wasn't expected to be a problem,
> as the accounting should bypass kernel threads.
>
> Was that assumption wrong, or is something different happening here?
>
> >
> > backtrace:
> > kobject_add_internal
> > kobject_init_and_add
> > sysfs_slab_add+0x1a8
> > __kmem_cache_create
> > create_cache
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> >
> > Signed-off-by: Qian Cai <[email protected]>
> > ---
> > arch/arm64/mm/pgd.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > index 769516cb6677..53c48f5c8765 100644
> > --- a/arch/arm64/mm/pgd.c
> > +++ b/arch/arm64/mm/pgd.c
> > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > if (PGD_SIZE == PAGE_SIZE)
> > return (pgd_t *)__get_free_page(gfp);
> > else
> > - return kmem_cache_alloc(pgd_cache, gfp);
> > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
>
> This is used to allocate PGDs for both user and kernel pagetables (e.g.
> for the efi runtime services), so while this may fix the regression, I'm
> not sure it's the right fix.
>
> Do we need a separate pgd_alloc_kernel()?

So can I take the above for -rc5, or is somebody else working on a different
fix to implement pgd_alloc_kernel()?

/confused

Will

2019-06-10 17:28:23

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > The commit "arm64: switch to generic version of pte allocation"
> > > introduced endless failures during boot like,
> > >
> > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > -2 parent: cgroup)
> > >
> > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > and then later memcg finds out those don't belong to any cgroup.
> >
> > Mike, I understood from [1] that this wasn't expected to be a problem,
> > as the accounting should bypass kernel threads.
> >
> > Was that assumption wrong, or is something different happening here?
> >
> > >
> > > backtrace:
> > > kobject_add_internal
> > > kobject_init_and_add
> > > sysfs_slab_add+0x1a8
> > > __kmem_cache_create
> > > create_cache
> > > memcg_create_kmem_cache
> > > memcg_kmem_cache_create_func
> > > process_one_work
> > > worker_thread
> > > kthread
> > >
> > > Signed-off-by: Qian Cai <[email protected]>
> > > ---
> > > arch/arm64/mm/pgd.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > index 769516cb6677..53c48f5c8765 100644
> > > --- a/arch/arm64/mm/pgd.c
> > > +++ b/arch/arm64/mm/pgd.c
> > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > if (PGD_SIZE == PAGE_SIZE)
> > > return (pgd_t *)__get_free_page(gfp);
> > > else
> > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> >
> > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > for the efi runtime services), so while this may fix the regression, I'm
> > not sure it's the right fix.
> >
> > Do we need a separate pgd_alloc_kernel()?
>
> So can I take the above for -rc5, or is somebody else working on a different
> fix to implement pgd_alloc_kernel()?

The offensive commit "arm64: switch to generic version of pte allocation" is not
yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
Andrew will push this out any time sooner given it is broken.

2019-06-11 10:05:07

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > The commit "arm64: switch to generic version of pte allocation"
> > > > introduced endless failures during boot like,
> > > >
> > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > -2 parent: cgroup)
> > > >
> > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > and then later memcg finds out those don't belong to any cgroup.
> > >
> > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > as the accounting should bypass kernel threads.
> > >
> > > Was that assumption wrong, or is something different happening here?
> > >
> > > >
> > > > backtrace:
> > > > kobject_add_internal
> > > > kobject_init_and_add
> > > > sysfs_slab_add+0x1a8
> > > > __kmem_cache_create
> > > > create_cache
> > > > memcg_create_kmem_cache
> > > > memcg_kmem_cache_create_func
> > > > process_one_work
> > > > worker_thread
> > > > kthread
> > > >
> > > > Signed-off-by: Qian Cai <[email protected]>
> > > > ---
> > > > arch/arm64/mm/pgd.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > index 769516cb6677..53c48f5c8765 100644
> > > > --- a/arch/arm64/mm/pgd.c
> > > > +++ b/arch/arm64/mm/pgd.c
> > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > if (PGD_SIZE == PAGE_SIZE)
> > > > return (pgd_t *)__get_free_page(gfp);
> > > > else
> > > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > >
> > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > for the efi runtime services), so while this may fix the regression, I'm
> > > not sure it's the right fix.
> > >
> > > Do we need a separate pgd_alloc_kernel()?
> >
> > So can I take the above for -rc5, or is somebody else working on a different
> > fix to implement pgd_alloc_kernel()?
>
> The offensive commit "arm64: switch to generic version of pte allocation" is not
> yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> Andrew will push this out any time sooner given it is broken.

I'd assumed that Mike would respin these patches to implement and use
pgd_alloc_kernel() (or take gfp flags) and the updated patches would
replace these in akpm's tree.

Mike, could you confirm what your plan is? I'm happy to review/test
updated patches for arm64.

Thanks,
Mark.

2019-06-11 12:42:48

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > introduced endless failures during boot like,
> > > > >
> > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > -2 parent: cgroup)
> > > > >
> > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > and then later memcg finds out those don't belong to any cgroup.
> > > >
> > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > as the accounting should bypass kernel threads.
> > > >
> > > > Was that assumption wrong, or is something different happening here?
> > > >
> > > > >
> > > > > backtrace:
> > > > > ? kobject_add_internal
> > > > > ? kobject_init_and_add
> > > > > ? sysfs_slab_add+0x1a8
> > > > > ? __kmem_cache_create
> > > > > ? create_cache
> > > > > ? memcg_create_kmem_cache
> > > > > ? memcg_kmem_cache_create_func
> > > > > ? process_one_work
> > > > > ? worker_thread
> > > > > ? kthread
> > > > >
> > > > > Signed-off-by: Qian Cai <[email protected]>
> > > > > ---
> > > > > ?arch/arm64/mm/pgd.c | 2 +-
> > > > > ?1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > --- a/arch/arm64/mm/pgd.c
> > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > > ? if (PGD_SIZE == PAGE_SIZE)
> > > > > ? return (pgd_t *)__get_free_page(gfp);
> > > > > ? else
> > > > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > >
> > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > not sure it's the right fix.
> > > >
> > > > Do we need a separate pgd_alloc_kernel()?
> > >
> > > So can I take the above for -rc5, or is somebody else working on a different
> > > fix to implement pgd_alloc_kernel()?
> >
> > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > Andrew will push this out any time sooner given it is broken.
>
> I'd assumed that Mike would respin these patches to implement and use
> pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> replace these in akpm's tree.
>
> Mike, could you confirm what your plan is? I'm happy to review/test
> updated patches for arm64.

Sorry for the delay, I'm mostly offline these days.

I wanted to understand first what is the reason for the failure. I've tried
to reproduce it with qemu, but I failed to find a bootable configuration
that will have PGD_SIZE != PAGE_SIZE :(

Qian Cai, can you share what is your environment and the kernel config?

> Thanks,
> Mark.
>

--
Sincerely yours,
Mike.

2019-06-11 12:47:26

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

> On Jun 11, 2019, at 8:41 AM, Mike Rapoport <[email protected]> wrote:
>
> On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
>> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
>>> On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
>>>> On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
>>>>> On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
>>>>>> The commit "arm64: switch to generic version of pte allocation"
>>>>>> introduced endless failures during boot like,
>>>>>>
>>>>>> kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
>>>>>> -2 parent: cgroup)
>>>>>>
>>>>>> It turns out __GFP_ACCOUNT is passed to kernel page table allocations
>>>>>> and then later memcg finds out those don't belong to any cgroup.
>>>>>
>>>>> Mike, I understood from [1] that this wasn't expected to be a problem,
>>>>> as the accounting should bypass kernel threads.
>>>>>
>>>>> Was that assumption wrong, or is something different happening here?
>>>>>
>>>>>>
>>>>>> backtrace:
>>>>>> kobject_add_internal
>>>>>> kobject_init_and_add
>>>>>> sysfs_slab_add+0x1a8
>>>>>> __kmem_cache_create
>>>>>> create_cache
>>>>>> memcg_create_kmem_cache
>>>>>> memcg_kmem_cache_create_func
>>>>>> process_one_work
>>>>>> worker_thread
>>>>>> kthread
>>>>>>
>>>>>> Signed-off-by: Qian Cai <[email protected]>
>>>>>> ---
>>>>>> arch/arm64/mm/pgd.c | 2 +-
>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
>>>>>> index 769516cb6677..53c48f5c8765 100644
>>>>>> --- a/arch/arm64/mm/pgd.c
>>>>>> +++ b/arch/arm64/mm/pgd.c
>>>>>> @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>>>>>> if (PGD_SIZE == PAGE_SIZE)
>>>>>> return (pgd_t *)__get_free_page(gfp);
>>>>>> else
>>>>>> - return kmem_cache_alloc(pgd_cache, gfp);
>>>>>> + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
>>>>>
>>>>> This is used to allocate PGDs for both user and kernel pagetables (e.g.
>>>>> for the efi runtime services), so while this may fix the regression, I'm
>>>>> not sure it's the right fix.
>>>>>
>>>>> Do we need a separate pgd_alloc_kernel()?
>>>>
>>>> So can I take the above for -rc5, or is somebody else working on a different
>>>> fix to implement pgd_alloc_kernel()?
>>>
>>> The offensive commit "arm64: switch to generic version of pte allocation" is not
>>> yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
>>> Andrew will push this out any time sooner given it is broken.
>>
>> I'd assumed that Mike would respin these patches to implement and use
>> pgd_alloc_kernel() (or take gfp flags) and the updated patches would
>> replace these in akpm's tree.
>>
>> Mike, could you confirm what your plan is? I'm happy to review/test
>> updated patches for arm64.
>
> Sorry for the delay, I'm mostly offline these days.
>
> I wanted to understand first what is the reason for the failure. I've tried
> to reproduce it with qemu, but I failed to find a bootable configuration
> that will have PGD_SIZE != PAGE_SIZE :(
>
> Qian Cai, can you share what is your environment and the kernel config?

https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config

# lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 4
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: Cavium
Model: 1
Model name: ThunderX2 99xx
Stepping: 0x1
BogoMIPS: 400.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 32768K
NUMA node0 CPU(s): 0-127
NUMA node1 CPU(s): 128-255
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

# dmidecode
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: HPE
Product Name: Apollo 70
Version: X1
Wake-up Type: Power Switch
Family: CN99XX

2019-06-11 13:59:22

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Tue, Jun 11, 2019 at 03:41:19PM +0300, Mike Rapoport wrote:
> On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > > introduced endless failures during boot like,
> > > > > >
> > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > > -2 parent: cgroup)
> > > > > >
> > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > > and then later memcg finds out those don't belong to any cgroup.
> > > > >
> > > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > > as the accounting should bypass kernel threads.
> > > > >
> > > > > Was that assumption wrong, or is something different happening here?
> > > > >
> > > > > >
> > > > > > backtrace:
> > > > > > kobject_add_internal
> > > > > > kobject_init_and_add
> > > > > > sysfs_slab_add+0x1a8
> > > > > > __kmem_cache_create
> > > > > > create_cache
> > > > > > memcg_create_kmem_cache
> > > > > > memcg_kmem_cache_create_func
> > > > > > process_one_work
> > > > > > worker_thread
> > > > > > kthread
> > > > > >
> > > > > > Signed-off-by: Qian Cai <[email protected]>
> > > > > > ---
> > > > > > arch/arm64/mm/pgd.c | 2 +-
> > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > > --- a/arch/arm64/mm/pgd.c
> > > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > > > if (PGD_SIZE == PAGE_SIZE)
> > > > > > return (pgd_t *)__get_free_page(gfp);
> > > > > > else
> > > > > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > > >
> > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > > not sure it's the right fix.
> > > > >
> > > > > Do we need a separate pgd_alloc_kernel()?
> > > >
> > > > So can I take the above for -rc5, or is somebody else working on a different
> > > > fix to implement pgd_alloc_kernel()?
> > >
> > > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > > Andrew will push this out any time sooner given it is broken.
> >
> > I'd assumed that Mike would respin these patches to implement and use
> > pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> > replace these in akpm's tree.
> >
> > Mike, could you confirm what your plan is? I'm happy to review/test
> > updated patches for arm64.
>
> Sorry for the delay, I'm mostly offline these days.
>
> I wanted to understand first what is the reason for the failure. I've tried
> to reproduce it with qemu, but I failed to find a bootable configuration
> that will have PGD_SIZE != PAGE_SIZE :(

This is the case with 48-bit VA and 64K pages. In that case we have
three levels of table, and the PGD is 1/16th of a page, as it only needs
to resolve 9 bits of virtual address rather than 13.

If you build defconfig + ARM64_64K_PAGES=y, that should be the case:

[mark@lakrids:~/src/linux]% usekorg 8.1.0 aarch64-linux-objdump -d arch/arm64/mm/pgd.o

arch/arm64/mm/pgd.o: file format elf64-littleaarch64

Disassembly of section .text:

0000000000000000 <pgd_alloc>:
0: a9bf7bfd stp x29, x30, [sp, #-16]!
4: 90000000 adrp x0, 0 <pgd_alloc>
8: 5281b801 mov w1, #0xdc0 // #3520
c: 910003fd mov x29, sp
10: f9400000 ldr x0, [x0]
14: 94000000 bl 0 <kmem_cache_alloc>
18: a8c17bfd ldp x29, x30, [sp], #16
1c: d65f03c0 ret

0000000000000020 <pgd_free>:
20: a9bf7bfd stp x29, x30, [sp, #-16]!
24: 90000000 adrp x0, 0 <pgd_alloc>
28: 910003fd mov x29, sp
2c: f9400000 ldr x0, [x0]
30: 94000000 bl 0 <kmem_cache_free>
34: a8c17bfd ldp x29, x30, [sp], #16
38: d65f03c0 ret

Disassembly of section .init.text:

0000000000000000 <pgd_cache_init>:
0: a9bf7bfd stp x29, x30, [sp, #-16]!
4: 52804002 mov w2, #0x200 // #512
8: d2800004 mov x4, #0x0 // #0
c: 910003fd mov x29, sp
10: 2a0203e1 mov w1, w2
14: 52a00083 mov w3, #0x40000 // #262144
18: 90000000 adrp x0, 0 <pgd_cache_init>
1c: 91000000 add x0, x0, #0x0
20: 94000000 bl 0 <kmem_cache_create>
24: 90000001 adrp x1, 0 <pgd_cache_init>
28: a8c17bfd ldp x29, x30, [sp], #16
2c: f9000020 str x0, [x1]
30: d65f03c0 ret

Thanks,
Mark.

2019-06-12 08:31:38

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

Hi,

On Tue, Jun 11, 2019 at 08:46:45AM -0400, Qian Cai wrote:
>
> > On Jun 11, 2019, at 8:41 AM, Mike Rapoport <[email protected]> wrote:
> >
> > Sorry for the delay, I'm mostly offline these days.
> >
> > I wanted to understand first what is the reason for the failure. I've tried
> > to reproduce it with qemu, but I failed to find a bootable configuration
> > that will have PGD_SIZE != PAGE_SIZE :(
> >
> > Qian Cai, can you share what is your environment and the kernel config?
>
> https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
>
> # lscpu
> Architecture: aarch64
> Byte Order: Little Endian
> CPU(s): 256
> On-line CPU(s) list: 0-255
> Thread(s) per core: 4
> Core(s) per socket: 32
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: Cavium
> Model: 1
> Model name: ThunderX2 99xx
> Stepping: 0x1
> BogoMIPS: 400.00
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 32768K
> NUMA node0 CPU(s): 0-127
> NUMA node1 CPU(s): 128-255
> Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm
>
> # dmidecode
> Handle 0x0001, DMI type 1, 27 bytes
> System Information
> Manufacturer: HPE
> Product Name: Apollo 70
> Version: X1
> Wake-up Type: Power Switch
> Family: CN99XX
>

Can you please also send the entire log when the failure happens?

Another question, is the problem exist with PGD_SIZE == PAGE_SIZE?

--
Sincerely yours,
Mike.

2019-06-12 18:36:26

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Wed, 2019-06-12 at 09:57 +0300, Mike Rapoport wrote:
> Hi,
>
> On Tue, Jun 11, 2019 at 08:46:45AM -0400, Qian Cai wrote:
> >
> > > On Jun 11, 2019, at 8:41 AM, Mike Rapoport <[email protected]> wrote:
> > >
> > > Sorry for the delay, I'm mostly offline these days.
> > >
> > > I wanted to understand first what is the reason for the failure. I've
> > > tried
> > > to reproduce it with qemu, but I failed to find a bootable configuration
> > > that will have PGD_SIZE != PAGE_SIZE :(
> > >
> > > Qian Cai, can you share what is your environment and the kernel config?
> >
> > https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
> >
> > # lscpu
> > Architecture:        aarch64
> > Byte Order:          Little Endian
> > CPU(s):              256
> > On-line CPU(s) list: 0-255
> > Thread(s) per core:  4
> > Core(s) per socket:  32
> > Socket(s):           2
> > NUMA node(s):        2
> > Vendor ID:           Cavium
> > Model:               1
> > Model name:          ThunderX2 99xx
> > Stepping:            0x1
> > BogoMIPS:            400.00
> > L1d cache:           32K
> > L1i cache:           32K
> > L2 cache:            256K
> > L3 cache:            32768K
> > NUMA node0 CPU(s):   0-127
> > NUMA node1 CPU(s):   128-255
> > Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics
> > cpuid asimdrdm
> >
> > # dmidecode
> > Handle 0x0001, DMI type 1, 27 bytes
> > System Information
> >         Manufacturer: HPE
> >         Product Name: Apollo 70
> >         Version: X1
> >         Wake-up Type: Power Switch
> >         Family: CN99XX
> >
>
> Can you please also send the entire log when the failure happens?

https://cailca.github.io/files/dmesg.txt

> Another question, is the problem exist with PGD_SIZE == PAGE_SIZE?

No.

2019-06-13 15:18:35

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Thu, 2019-06-13 at 15:11 +0300, Mike Rapoport wrote:
> The log Qian Cai posted at [1] and partially cited below confirms that the
> failure happens when *user* PGDs are allocated and the addition of
> __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
> issue.
>
> I'm still failing to reproduce it with qemu and I'm not really familiar
> with slub/memcg code to say anything smart about it. Will keep looking.
>
> Note, that as failures start way after efi_virtmap_init() that allocates a
> PGD for efi_mm, there are no real fixes required for the original series,
> except that the check for mm == &init_mm I copied for some reason from
> powerpc is bogus and can be removed.

Yes, there is more places are not happy with __GFP_ACCOUNT other than efi_mm.
For example,

[  132.786842][ T1501] kobject_add_internal failed for pgd_cache(49:systemd-
udevd.service) (error: -2 parent: cgroup)
[  132.795589][ T1889] CPU: 9 PID: 1889 Comm: systemd-udevd Tainted:
G        W         5.2.0-rc4-next-20190613+ #8
[  132.807356][ T1889] Hardware name: HPE Apollo
70             /C01_APACHE_MB         , BIOS L50_5.13_1.0.9 03/01/2019
[  132.817872][ T1889] Call trace:
[  132.821017][ T1889]  dump_backtrace+0x0/0x268
[  132.825372][ T1889]  show_stack+0x20/0x2c
[  132.829380][ T1889]  dump_stack+0xb4/0x108
[  132.833475][ T1889]  pgd_alloc+0x34/0x5c
[  132.837396][ T1889]  mm_init+0x27c/0x32c
[  132.841315][ T1889]  dup_mm+0x84/0x7b4
[  132.845061][ T1889]  copy_process+0xf20/0x24cc
[  132.849500][ T1889]  _do_fork+0xa4/0x66c
[  132.853420][ T1889]  __arm64_sys_clone+0x114/0x1b4
[  132.858208][ T1889]  el0_svc_handler+0x198/0x260
[  132.862821][ T1889]  el0_svc+0x8/0xc

>
> I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure
> we won't run into issues with memcg in the future.
>
> [   82.125966] Freeing unused kernel memory: 28672K
> [   87.940365] Checked W+X mappings: passed, no W+X pages found
> [   87.946769] Run /init as init process
> [   88.040040] systemd[1]: System time before build time, advancing clock.
> [   88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or
> directory
> [   88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left
> [   88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT
> +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT
> +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
> default-hierarchy=legacy)
> [   88.498398] systemd[1]: Detected architecture arm64.
> [   88.506517] systemd[1]: Running in initial RAM disk.
> [   89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left
> [   90.222658] random: systemd: uninitialized urandom read (16 bytes read)
> [   90.230072] systemd[1]: Reached target Swap.
> [   90.240205] random: systemd: uninitialized urandom read (16 bytes read)
> [   90.251088] systemd[1]: Reached target Timers.
> [   90.261303] random: systemd: uninitialized urandom read (16 bytes read)
> [   90.271209] systemd[1]: Listening on udev Control Socket.
> [   90.283238] systemd[1]: Reached target Local File Systems.
> [   90.296232] systemd[1]: Reached target Slices.
> [   90.307239] systemd[1]: Listening on udev Kernel Socket.
> [   90.608597] kobject_add_internal failed for pgd_cache(13:init.scope)
> (error: -2 parent: cgroup)
> [   90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error:
> -2 parent: cgroup)
> [   90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-
> setup.service) (error: -2 parent: cgroup)
> [   90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes
> left
> [   90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error:
> -2 parent: cgroup)
>
> > Thanks,
> > Mark.
> >
>
> [1] https://cailca.github.io/files/dmesg.txt
>

2019-06-13 15:26:40

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

(added Roman)

On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > introduced endless failures during boot like,
> > > > >
> > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > -2 parent: cgroup)
> > > > >
> > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > and then later memcg finds out those don't belong to any cgroup.
> > > >
> > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > as the accounting should bypass kernel threads.
> > > >
> > > > Was that assumption wrong, or is something different happening here?
> > > >
> > > > >
> > > > > backtrace:
> > > > > ? kobject_add_internal
> > > > > ? kobject_init_and_add
> > > > > ? sysfs_slab_add+0x1a8
> > > > > ? __kmem_cache_create
> > > > > ? create_cache
> > > > > ? memcg_create_kmem_cache
> > > > > ? memcg_kmem_cache_create_func
> > > > > ? process_one_work
> > > > > ? worker_thread
> > > > > ? kthread
> > > > >
> > > > > Signed-off-by: Qian Cai <[email protected]>
> > > > > ---
> > > > > ?arch/arm64/mm/pgd.c | 2 +-
> > > > > ?1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > --- a/arch/arm64/mm/pgd.c
> > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > > ? if (PGD_SIZE == PAGE_SIZE)
> > > > > ? return (pgd_t *)__get_free_page(gfp);
> > > > > ? else
> > > > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > >
> > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > not sure it's the right fix.
> > > >
> > > > Do we need a separate pgd_alloc_kernel()?
> > >
> > > So can I take the above for -rc5, or is somebody else working on a different
> > > fix to implement pgd_alloc_kernel()?
> >
> > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > Andrew will push this out any time sooner given it is broken.
>
> I'd assumed that Mike would respin these patches to implement and use
> pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> replace these in akpm's tree.
>
> Mike, could you confirm what your plan is? I'm happy to review/test
> updated patches for arm64.

The log Qian Cai posted at [1] and partially cited below confirms that the
failure happens when *user* PGDs are allocated and the addition of
__GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
issue.

I'm still failing to reproduce it with qemu and I'm not really familiar
with slub/memcg code to say anything smart about it. Will keep looking.

Note, that as failures start way after efi_virtmap_init() that allocates a
PGD for efi_mm, there are no real fixes required for the original series,
except that the check for mm == &init_mm I copied for some reason from
powerpc is bogus and can be removed.

I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure
we won't run into issues with memcg in the future.

[ 82.125966] Freeing unused kernel memory: 28672K
[ 87.940365] Checked W+X mappings: passed, no W+X pages found
[ 87.946769] Run /init as init process
[ 88.040040] systemd[1]: System time before build time, advancing clock.
[ 88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or directory
[ 88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left
[ 88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT
+SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT
+GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
default-hierarchy=legacy)
[ 88.498398] systemd[1]: Detected architecture arm64.
[ 88.506517] systemd[1]: Running in initial RAM disk.
[ 89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left
[ 90.222658] random: systemd: uninitialized urandom read (16 bytes read)
[ 90.230072] systemd[1]: Reached target Swap.
[ 90.240205] random: systemd: uninitialized urandom read (16 bytes read)
[ 90.251088] systemd[1]: Reached target Timers.
[ 90.261303] random: systemd: uninitialized urandom read (16 bytes read)
[ 90.271209] systemd[1]: Listening on udev Control Socket.
[ 90.283238] systemd[1]: Reached target Local File Systems.
[ 90.296232] systemd[1]: Reached target Slices.
[ 90.307239] systemd[1]: Listening on udev Kernel Socket.
[ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
[ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error: -2 parent: cgroup)
[ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)
[ 90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes left
[ 90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error: -2 parent: cgroup)

> Thanks,
> Mark.
>

[1] https://cailca.github.io/files/dmesg.txt

--
Sincerely yours,
Mike.

2019-06-13 19:46:30

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Thu, Jun 13, 2019 at 09:22:36AM -0400, Qian Cai wrote:
> On Thu, 2019-06-13 at 15:11 +0300, Mike Rapoport wrote:
> > The log Qian Cai posted at [1] and partially cited below confirms that the
> > failure happens when *user* PGDs are allocated and the addition of
> > __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
> > issue.
> >
> > I'm still failing to reproduce it with qemu and I'm not really familiar
> > with slub/memcg code to say anything smart about it. Will keep looking.
> >
> > Note, that as failures start way after efi_virtmap_init() that allocates a
> > PGD for efi_mm, there are no real fixes required for the original series,
> > except that the check for mm == &init_mm I copied for some reason from
> > powerpc is bogus and can be removed.
>
> Yes, there is more places are not happy with __GFP_ACCOUNT other than efi_mm.
> For example,

Here we allocate the pgd for a user process and it should be accounted.

Actually, the whole point of changing the gfp flags in arm64::pgd_alloc()
was to enable the accounting for memory occupied by user pgds, just like
x86 and powerpc do.

> [??132.786842][ T1501] kobject_add_internal failed for pgd_cache(49:systemd-
> udevd.service) (error: -2 parent: cgroup)
> [??132.795589][ T1889] CPU: 9 PID: 1889 Comm: systemd-udevd Tainted:
> G????????W?????????5.2.0-rc4-next-20190613+ #8
> [??132.807356][ T1889] Hardware name: HPE Apollo
> 70?????????????/C01_APACHE_MB?????????, BIOS L50_5.13_1.0.9 03/01/2019
> [??132.817872][ T1889] Call trace:
> [??132.821017][ T1889]??dump_backtrace+0x0/0x268
> [??132.825372][ T1889]??show_stack+0x20/0x2c
> [??132.829380][ T1889]??dump_stack+0xb4/0x108
> [??132.833475][ T1889]??pgd_alloc+0x34/0x5c
> [??132.837396][ T1889]??mm_init+0x27c/0x32c
> [??132.841315][ T1889]??dup_mm+0x84/0x7b4
> [??132.845061][ T1889]??copy_process+0xf20/0x24cc
> [??132.849500][ T1889]??_do_fork+0xa4/0x66c
> [??132.853420][ T1889]??__arm64_sys_clone+0x114/0x1b4
> [??132.858208][ T1889]??el0_svc_handler+0x198/0x260
> [??132.862821][ T1889]??el0_svc+0x8/0xc
>
> >
> > I surely can add pgd_alloc_kernel() to be used by the EFI code to make sure
> > we won't run into issues with memcg in the future.
> >
> > [???82.125966] Freeing unused kernel memory: 28672K
> > [???87.940365] Checked W+X mappings: passed, no W+X pages found
> > [???87.946769] Run /init as init process
> > [???88.040040] systemd[1]: System time before build time, advancing clock.
> > [???88.054593] systemd[1]: Failed to insert module 'autofs4': No such file or
> > directory
> > [???88.374129] modprobe (1726) used greatest stack depth: 28464 bytes left
> > [???88.470108] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT
> > +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT
> > +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
> > default-hierarchy=legacy)
> > [???88.498398] systemd[1]: Detected architecture arm64.
> > [???88.506517] systemd[1]: Running in initial RAM disk.
> > [???89.621995] mkdir (1730) used greatest stack depth: 27872 bytes left
> > [???90.222658] random: systemd: uninitialized urandom read (16 bytes read)
> > [???90.230072] systemd[1]: Reached target Swap.
> > [???90.240205] random: systemd: uninitialized urandom read (16 bytes read)
> > [???90.251088] systemd[1]: Reached target Timers.
> > [???90.261303] random: systemd: uninitialized urandom read (16 bytes read)
> > [???90.271209] systemd[1]: Listening on udev Control Socket.
> > [???90.283238] systemd[1]: Reached target Local File Systems.
> > [???90.296232] systemd[1]: Reached target Slices.
> > [???90.307239] systemd[1]: Listening on udev Kernel Socket.
> > [???90.608597] kobject_add_internal failed for pgd_cache(13:init.scope)
> > (error: -2 parent: cgroup)
> > [???90.678007] kobject_add_internal failed for pgd_cache(13:init.scope)(error:
> > -2 parent: cgroup)
> > [???90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-
> > setup.service) (error: -2 parent: cgroup)
> > [???90.820012] systemd-tmpfile (1759) used greatest stack depth: 27184 bytes
> > left
> > [???90.861942] kobject_add_internal failed for pgd_cache(13:init.scope) error:
> > -2 parent: cgroup)
> > ?
> > > Thanks,
> > > Mark.
> > >
> >
> > [1] https://cailca.github.io/files/dmesg.txt
> >

--
Sincerely yours,
Mike.

2019-06-17 15:15:12

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

On Thu, Jun 13, 2019 at 03:11:01PM +0300, Mike Rapoport wrote:
> On Tue, Jun 11, 2019 at 11:03:49AM +0100, Mark Rutland wrote:
> > On Mon, Jun 10, 2019 at 01:26:15PM -0400, Qian Cai wrote:
> > > On Mon, 2019-06-10 at 12:43 +0100, Will Deacon wrote:
> > > > On Tue, Jun 04, 2019 at 03:23:38PM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 04, 2019 at 10:00:36AM -0400, Qian Cai wrote:
> > > > > > The commit "arm64: switch to generic version of pte allocation"
> > > > > > introduced endless failures during boot like,
> > > > > >
> > > > > > kobject_add_internal failed for pgd_cache(285:chronyd.service) (error:
> > > > > > -2 parent: cgroup)
> > > > > >
> > > > > > It turns out __GFP_ACCOUNT is passed to kernel page table allocations
> > > > > > and then later memcg finds out those don't belong to any cgroup.
> > > > >
> > > > > Mike, I understood from [1] that this wasn't expected to be a problem,
> > > > > as the accounting should bypass kernel threads.
> > > > >
> > > > > Was that assumption wrong, or is something different happening here?
> > > > >
> > > > > >
> > > > > > backtrace:
> > > > > > ? kobject_add_internal
> > > > > > ? kobject_init_and_add
> > > > > > ? sysfs_slab_add+0x1a8
> > > > > > ? __kmem_cache_create
> > > > > > ? create_cache
> > > > > > ? memcg_create_kmem_cache
> > > > > > ? memcg_kmem_cache_create_func
> > > > > > ? process_one_work
> > > > > > ? worker_thread
> > > > > > ? kthread
> > > > > >
> > > > > > Signed-off-by: Qian Cai <[email protected]>
> > > > > > ---
> > > > > > ?arch/arm64/mm/pgd.c | 2 +-
> > > > > > ?1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
> > > > > > index 769516cb6677..53c48f5c8765 100644
> > > > > > --- a/arch/arm64/mm/pgd.c
> > > > > > +++ b/arch/arm64/mm/pgd.c
> > > > > > @@ -38,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
> > > > > > ? if (PGD_SIZE == PAGE_SIZE)
> > > > > > ? return (pgd_t *)__get_free_page(gfp);
> > > > > > ? else
> > > > > > - return kmem_cache_alloc(pgd_cache, gfp);
> > > > > > + return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
> > > > >
> > > > > This is used to allocate PGDs for both user and kernel pagetables (e.g.
> > > > > for the efi runtime services), so while this may fix the regression, I'm
> > > > > not sure it's the right fix.
> > > > >
> > > > > Do we need a separate pgd_alloc_kernel()?
> > > >
> > > > So can I take the above for -rc5, or is somebody else working on a different
> > > > fix to implement pgd_alloc_kernel()?
> > >
> > > The offensive commit "arm64: switch to generic version of pte allocation" is not
> > > yet in the mainline, but only in the Andrew's tree and linux-next, and I doubt
> > > Andrew will push this out any time sooner given it is broken.
> >
> > I'd assumed that Mike would respin these patches to implement and use
> > pgd_alloc_kernel() (or take gfp flags) and the updated patches would
> > replace these in akpm's tree.
> >
> > Mike, could you confirm what your plan is? I'm happy to review/test
> > updated patches for arm64.
>
> The log Qian Cai posted at [1] and partially cited below confirms that the
> failure happens when *user* PGDs are allocated and the addition of
> __GFP_ACCOUNT to gfp flags used by pgd_alloc() only uncovered another
> issue.

Indeed the accounting of the PGD memory uncovered a dangling pointer to
pgd_cache :)

The pgd_cache was initialized twice and it made memcg and slub sysfs go
nuts. To be frank, I've got lost in their cross-initialization,
cross-referencing and update sequences, but for sure extra initialization
of pgd_cache was bogus.

I've double checked the 'if (mm == &init_mm)' and it's not needed. The EFI
PGD is allocated before memcg is up and other kernel allocations of pgd (if
we'll have any) would be bypassed by memcg_kmem_bypass().

Andrew, can you please add the patch below as an incremental fix?

With this the arm64::pgd_alloc() should be in the right shape.

From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <[email protected]>
Date: Mon, 17 Jun 2019 17:37:43 +0300
Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice

When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD
memory. That cache was initialized twice: first through
pgtable_cache_init() alias and then as an override for weak
pgd_cache_init().

After enabling accounting for the PGD memory, this created a confusion for
memcg and slub sysfs code which resulted in the following errors:

[ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
[ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
[ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)

Removing the alias from pgtable_cache_init() and keeping the only pgd_cache
initialization in pgd_cache_init() resolves the problem and allows
accounting of PGD memory.

Reported-by: Qian Cai <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 3 +--
arch/arm64/mm/pgd.c | 5 +----
2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3191b9f..c7a802d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -851,8 +851,7 @@ extern int kern_addr_valid(unsigned long addr);

#include <asm-generic/pgtable.h>

-void pgd_cache_init(void);
-#define pgtable_cache_init pgd_cache_init
+static inline void pgtable_cache_init(void) { }

/*
* On AArch64, the cache coherency is handled via the set_pte_at() function.
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 53c48f5..3f0a744 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -32,13 +32,10 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
{
gfp_t gfp = GFP_PGTABLE_USER;

- if (unlikely(mm == &init_mm))
- gfp = GFP_PGTABLE_KERNEL;
-
if (PGD_SIZE == PAGE_SIZE)
return (pgd_t *)__get_free_page(gfp);
else
- return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_KERNEL);
+ return kmem_cache_alloc(pgd_cache, gfp);
}

void pgd_free(struct mm_struct *mm, pgd_t *pgd)
--
2.7.4

> [1] https://cailca.github.io/files/dmesg.txt
>
> --
> Sincerely yours,
> Mike.
>

--
Sincerely yours,
Mike.

2019-06-17 16:36:57

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH -next] arm64/mm: fix a bogus GFP flag in pgd_alloc()

Hi Mike,

On Mon, Jun 17, 2019 at 06:12:52PM +0300, Mike Rapoport wrote:
> Andrew, can you please add the patch below as an incremental fix?
>
> With this the arm64::pgd_alloc() should be in the right shape.
>
>
> From 1c1ef0bc04c655689c6c527bd03b140251399d87 Mon Sep 17 00:00:00 2001
> From: Mike Rapoport <[email protected]>
> Date: Mon, 17 Jun 2019 17:37:43 +0300
> Subject: [PATCH] arm64/mm: don't initialize pgd_cache twice
>
> When PGD_SIZE != PAGE_SIZE, arm64 uses kmem_cache for allocation of PGD
> memory. That cache was initialized twice: first through
> pgtable_cache_init() alias and then as an override for weak
> pgd_cache_init().
>
> After enabling accounting for the PGD memory, this created a confusion for
> memcg and slub sysfs code which resulted in the following errors:
>
> [ 90.608597] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> [ 90.678007] kobject_add_internal failed for pgd_cache(13:init.scope) (error: -2 parent: cgroup)
> [ 90.713260] kobject_add_internal failed for pgd_cache(21:systemd-tmpfiles-setup.service) (error: -2 parent: cgroup)
>
> Removing the alias from pgtable_cache_init() and keeping the only pgd_cache
> initialization in pgd_cache_init() resolves the problem and allows
> accounting of PGD memory.
>
> Reported-by: Qian Cai <[email protected]>
> Signed-off-by: Mike Rapoport <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 3 +--
> arch/arm64/mm/pgd.c | 5 +----
> 2 files changed, 2 insertions(+), 6 deletions(-)

Looks like this actually fixes caa841360134 ("x86/mm: Initialize PGD cache
during mm initialization") due to an unlucky naming conflict!

In which case, I'd actually prefer to take this fix asap via the arm64
tree. Is that ok?

Will

2019-06-18 06:52:38