2019-08-04 22:24:56

by Mikhail Gavrilov

[permalink] [raw]
Subject: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

Hi folks,
Two weeks ago when commit 22051d9c4a57 coming to my system.
Started happen randomly errors:
"gnome-shell: page allocation failure: order:4,
mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
nodemask=(null),cpuset=/,mems_allowed=0"
Symptoms:
The screen goes out as in energy saving.
And it is impossible to wake the computer in a few minutes.

I am making bisect and looks like the first bad commit is 476e955dd679.
Here full bisect logs: https://mega.nz/#F!kgYFxAIb!v1tcHANPy2ns1lh4LQLeIg

I wrote about my find to the amd-gfx mailing list, but no one answer me.
Until yesterday, I thought it was a bug in the amdgpu driver.
But yesterday, after the next occurrence of an error, the system hangs
completely already with another error.

[ 3225.317560] BUG: unable to handle page fault for address: 000000000000c9f4
[ 3225.317562] #PF: supervisor read access in kernel mode
[ 3225.317563] #PF: error_code(0x0000) - not-present page
[ 3225.317565] PGD 0 P4D 0
[ 3225.317567] Oops: 0000 [#1] SMP NOPTI
[ 3225.317571] CPU: 2 PID: 12717 Comm: Xorg Tainted: G W
5.3.0-0.rc2.git4.1.fc31.x86_64 #1
[ 3225.317572] Hardware name: System manufacturer System Product
Name/ROG STRIX X470-I GAMING, BIOS 2406 06/21/2019
[ 3225.317625] RIP: 0010:dc_resource_state_copy_construct+0x18/0xf0 [amdgpu]
[ 3225.317627] Code: 00 49 83 c4 01 44 39 e0 7f b5 5b 5d 41 5c 41 5d
c3 c3 0f 1f 44 00 00 41 56 ba f8 c9 00 00 41 55 41 54 49 89 f4 55 4c
89 e5 53 <44> 8b ae f4 c9 00 00 48 89 fe 4c 89 e7 e8 16 86 48 f7 49 8d
84 24
[ 3225.317630] RSP: 0018:ffffb439c3e377d0 EFLAGS: 00010246
[ 3225.317631] RAX: ffff9b0ba19a0000 RBX: ffffffffc08380b0 RCX: 0000000000000006
[ 3225.317633] RDX: 000000000000c9f8 RSI: 0000000000000000 RDI: ffff9b0ab7fc0000
[ 3225.317635] RBP: 0000000000000000 R08: 000002eef3c694b7 R09: 0000000000000000
[ 3225.317636] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 3225.317638] R13: ffff9b0bb5381000 R14: ffff9b09acc68598 R15: ffff9b09acc68540
[ 3225.317640] FS: 00007fdde56cbf00(0000) GS:ffff9b0bba400000(0000)
knlGS:0000000000000000
[ 3225.317641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3225.317643] CR2: 000000000000c9f4 CR3: 00000007382ee000 CR4: 00000000003406e0
[ 3225.317644] Call Trace:
[ 3225.317714] amdgpu_dm_atomic_commit_tail.cold+0xad/0xe1 [amdgpu]
[ 3225.317719] ? lockdep_hardirqs_on+0xf0/0x180
[ 3225.317723] ? debug_check_no_obj_freed+0x107/0x1d8
[ 3225.317786] ? dm_determine_update_type_for_commit+0x34c/0x420 [amdgpu]
[ 3225.317850] ? dm_determine_update_type_for_commit+0x34c/0x420 [amdgpu]
[ 3225.317855] ? kfree+0x1b6/0x3b0
[ 3225.317918] ? dm_determine_update_type_for_commit+0x34c/0x420 [amdgpu]
[ 3225.317923] ? __lock_acquire+0x247/0x1910
[ 3225.317928] ? find_held_lock+0x32/0x90
[ 3225.317931] ? mark_held_locks+0x50/0x80
[ 3225.317934] ? _raw_spin_unlock_irq+0x29/0x40
[ 3225.317937] ? lockdep_hardirqs_on+0xf0/0x180
[ 3225.317939] ? _raw_spin_unlock_irq+0x29/0x40
[ 3225.317942] ? wait_for_completion_timeout+0x75/0x190
[ 3225.317954] ? commit_tail+0x3c/0x70 [drm_kms_helper]
[ 3225.317960] commit_tail+0x3c/0x70 [drm_kms_helper]
[ 3225.317968] drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[ 3225.317975] drm_atomic_helper_disable_plane+0x82/0xb0 [drm_kms_helper]
[ 3225.317994] drm_mode_cursor_universal+0x12c/0x240 [drm]
[ 3225.318011] drm_mode_cursor_common+0xd8/0x230 [drm]
[ 3225.318026] ? drm_mode_setplane+0x1a0/0x1a0 [drm]
[ 3225.318038] drm_mode_cursor_ioctl+0x4d/0x70 [drm]
[ 3225.318049] drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 3225.318061] drm_ioctl+0x208/0x390 [drm]
[ 3225.318075] ? drm_mode_setplane+0x1a0/0x1a0 [drm]
[ 3225.318079] ? lockdep_hardirqs_on+0xf0/0x180
[ 3225.318145] amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 3225.318164] do_vfs_ioctl+0x411/0x750
[ 3225.318175] ksys_ioctl+0x5e/0x90
[ 3225.318179] __x64_sys_ioctl+0x16/0x20
[ 3225.318188] do_syscall_64+0x5c/0xb0
[ 3225.318191] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3225.318194] RIP: 0033:0x7fdde5b4007b
[ 3225.318203] Code: 0f 1e fa 48 8b 05 0d 9e 0c 00 64 c7 00 26 00 00
00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 9d 0c 00 f7 d8 64 89
01 48
[ 3225.318209] RSP: 002b:00007ffec481a6d8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 3225.318213] RAX: ffffffffffffffda RBX: 00007ffec481a710 RCX: 00007fdde5b4007b
[ 3225.318215] RDX: 00007ffec481a710 RSI: 00000000c01c64a3 RDI: 000000000000000e
[ 3225.318217] RBP: 00000000c01c64a3 R08: 0000000000000080 R09: 0000000000000000
[ 3225.318218] R10: 0000000000000004 R11: 0000000000000246 R12: 00000000000006f1
[ 3225.318220] R13: 000000000000000e R14: 000056201b5b5490 R15: 000056201bbe7820
[ 3225.318225] Modules linked in: macvtap macvlan tap rfcomm
xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp
llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT
nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack
ebtable_nat ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw
iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter cmac bnep sunrpc vfat fat snd_hda_codec_realtek
edac_mce_amd snd_hda_codec_generic ledtrig_audio kvm_amd rtwpci
snd_hda_codec_hdmi rtw88 kvm snd_hda_intel snd_usb_audio snd_hda_codec
mac80211 snd_hda_core snd_usbmidi_lib irqbypass snd_rawmidi uvcvideo
snd_hwdep snd_seq videobuf2_vmalloc videobuf2_memops btusb
videobuf2_v4l2 crct10dif_pclmul snd_seq_device videobuf2_common btrtl
crc32_pclmul eeepc_wmi snd_pcm btbcm btintel asus_wmi xpad snd_timer
sparse_keymap
[ 3225.318261] videodev ff_memless bluetooth joydev
ghash_clmulni_intel cfg80211 video snd mc k10temp wmi_bmof soundcore
ecdh_generic sp5100_tco ecc rfkill ccp i2c_piix4 libarc4 gpio_amdpt
gpio_generic acpi_cpufreq binfmt_misc ip_tables hid_logitech_hidpp
amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper drm igb crc32c_intel
dca i2c_algo_bit hid_logitech_dj nvme nvme_core wmi pinctrl_amd
[ 3225.318283] CR2: 000000000000c9f4

Every time when I see "SMP NOPTI" error I think that something wrong
happens with memory management.
So I decided to ask for help on the linux-mm mailing list.
Anyway for unknown reasons AMD developers ignored me.

Thanks.

--
Best Regards,
Mike Gavrilov.


Attachments:
dmesg.txt (224.36 kB)

2019-08-05 01:05:04

by Dave Airlie

[permalink] [raw]
Subject: Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

On Mon, 5 Aug 2019 at 08:23, Mikhail Gavrilov
<[email protected]> wrote:
>
> Hi folks,
> Two weeks ago when commit 22051d9c4a57 coming to my system.
> Started happen randomly errors:
> "gnome-shell: page allocation failure: order:4,
> mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
> nodemask=(null),cpuset=/,mems_allowed=0"
> Symptoms:
> The screen goes out as in energy saving.
> And it is impossible to wake the computer in a few minutes.
>
> I am making bisect and looks like the first bad commit is 476e955dd679.
> Here full bisect logs: https://mega.nz/#F!kgYFxAIb!v1tcHANPy2ns1lh4LQLeIg
>
> I wrote about my find to the amd-gfx mailing list, but no one answer me.
> Until yesterday, I thought it was a bug in the amdgpu driver.
> But yesterday, after the next occurrence of an error, the system hangs
> completely already with another error.

Does it happen if you disable CONFIG_DRM_AMD_DC_DCN2_0, I'm assuming
you don't have a navi gpu.

I think some struct grew too large in the navi merge, hopefully amd
care, else we have to disable navi before release.

I've directed this at the main AMD devs who might be helpful.

Dave.

2019-08-05 17:15:54

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

On Mon, 5 Aug 2019 at 08:21, Hillf Danton <[email protected]> wrote:
>
>
>
> Try to fix the failure above using vmalloc + kmalloc.
>
> --- a/drivers/gpu/drm/amd/display/dc/core/dc.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
> @@ -1174,8 +1174,12 @@ struct dc_state *dc_create_state(struct
> struct dc_state *context = kzalloc(sizeof(struct dc_state),
> GFP_KERNEL);
>
> - if (!context)
> - return NULL;
> + if (!context) {
> + context = kvzalloc(sizeof(struct dc_state),
> + GFP_KERNEL);
> + if (!context)
> + return NULL;
> + }
> /* Each context must have their own instance of VBA and in order to
> * initialize and obtain IP and SOC the base DML instance from DC is
> * initially copied into every context
> @@ -1195,8 +1199,13 @@ struct dc_state *dc_copy_state(struct dc
> struct dc_state *new_ctx = kmemdup(src_ctx,
> sizeof(struct dc_state), GFP_KERNEL);
>
> - if (!new_ctx)
> - return NULL;
> + if (!new_ctx) {
> + new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
> + if (new_ctx)
> + *new_ctx = *src_ctx;
> + else
> + return NULL;
> + }
>
> for (i = 0; i < MAX_PIPES; i++) {
> struct pipe_ctx *cur_pipe = &new_ctx->res_ctx.pipe_ctx[i];
> @@ -1230,7 +1239,7 @@ static void dc_state_free(struct kref *k
> {
> struct dc_state *context = container_of(kref, struct dc_state, refcount);
> dc_resource_state_destruct(context);
> - kfree(context);
> + kvfree(context);
> }
>
> void dc_release_state(struct dc_state *context)
> --

Unfortunately couldn't check this patch because, with the patch, the
kernel did not compile.
Here is compile error messages:

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function
'dc_create_state':
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1178:13: error:
implicit declaration of function 'kvzalloc'; did you mean 'kzalloc'?
[-Werror=implicit-function-declaration]
1178 | context = kvzalloc(sizeof(struct dc_state),
| ^~~~~~~~
| kzalloc
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1178:11: warning:
assignment to 'struct dc_state *' from 'int' makes pointer from
integer without a cast [-Wint-conversion]
1178 | context = kvzalloc(sizeof(struct dc_state),
| ^
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_copy_state':
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1203:13: error:
implicit declaration of function 'kvmalloc'; did you mean 'kmalloc'?
[-Werror=implicit-function-declaration]
1203 | new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
| ^~~~~~~~
| kmalloc
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1203:11: warning:
assignment to 'struct dc_state *' from 'int' makes pointer from
integer without a cast [-Wint-conversion]
1203 | new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
| ^
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_state_free':
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1242:2: error:
implicit declaration of function 'kvfree'; did you mean 'kzfree'?
[-Werror=implicit-function-declaration]
1242 | kvfree(context);
| ^~~~~~
| kzfree
cc1: some warnings being treated as errors
make[4]: *** [scripts/Makefile.build:274:
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.o] Error 1
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [scripts/Makefile.build:490: drivers/gpu/drm/amd/amdgpu] Error 2
make[3]: *** Waiting for unfinished jobs....
make: *** [Makefile:1084: drivers] Error 2


--
Best Regards,
Mike Gavrilov.

2019-08-08 05:35:18

by Alex Deucher

[permalink] [raw]
Subject: Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

On Wed, Aug 7, 2019 at 11:49 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> On Tue, 6 Aug 2019 at 06:48, Hillf Danton <[email protected]> wrote:
> >
> > My bad, respin with one header file added.
> >
> > Hillf
> > -----8<---
> >
> > --- a/drivers/gpu/drm/amd/display/dc/core/dc.c
> > +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
> > @@ -23,6 +23,7 @@
> > */
> >
> > #include <linux/slab.h>
> > +#include <linux/mm.h>
> >
> > #include "dm_services.h"
> >
> > @@ -1174,8 +1175,12 @@ struct dc_state *dc_create_state(struct
> > struct dc_state *context = kzalloc(sizeof(struct dc_state),
> > GFP_KERNEL);
> >
> > - if (!context)
> > - return NULL;
> > + if (!context) {
> > + context = kvzalloc(sizeof(struct dc_state),
> > + GFP_KERNEL);
> > + if (!context)
> > + return NULL;
> > + }
> > /* Each context must have their own instance of VBA and in order to
> > * initialize and obtain IP and SOC the base DML instance from DC is
> > * initially copied into every context
> > @@ -1195,8 +1200,13 @@ struct dc_state *dc_copy_state(struct dc
> > struct dc_state *new_ctx = kmemdup(src_ctx,
> > sizeof(struct dc_state), GFP_KERNEL);
> >
> > - if (!new_ctx)
> > - return NULL;
> > + if (!new_ctx) {
> > + new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
> > + if (new_ctx)
> > + *new_ctx = *src_ctx;
> > + else
> > + return NULL;
> > + }
> >
> > for (i = 0; i < MAX_PIPES; i++) {
> > struct pipe_ctx *cur_pipe = &new_ctx->res_ctx.pipe_ctx[i];
> > @@ -1230,7 +1240,7 @@ static void dc_state_free(struct kref *k
> > {
> > struct dc_state *context = container_of(kref, struct dc_state, refcount);
> > dc_resource_state_destruct(context);
> > - kfree(context);
> > + kvfree(context);
> > }
> >
> > void dc_release_state(struct dc_state *context)
> > --
> >
>
> Unfortunately error "gnome-shell: page allocation failure: order:4,
> mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
> nodemask=(null),cpuset=/,mems_allowed=0" still happens even with
> applying this patch.

I think we can just drop the kmalloc altogether. How about this patch?

Alex

>
> Thanks.
>
>
> --
> Best Regards,
> Mike Gavrilov.
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Attachments:
0001-drm-amd-display-use-kvmalloc-for-dc_state.patch (1.57 kB)

2019-08-08 08:23:23

by Michel Dänzer

[permalink] [raw]
Subject: Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

On 2019-08-08 7:31 a.m., Alex Deucher wrote:
> On Wed, Aug 7, 2019 at 11:49 PM Mikhail Gavrilov
> <[email protected]> wrote:
>>
>> Unfortunately error "gnome-shell: page allocation failure: order:4,
>> mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
>> nodemask=(null),cpuset=/,mems_allowed=0" still happens even with
>> applying this patch.
>
> I think we can just drop the kmalloc altogether. How about this patch?

Memory allocated by kvz/malloc needs to be freed with kvfree.


--
Earthling Michel Dänzer | https://www.amd.com
Libre software enthusiast | Mesa and X developer

2019-08-08 14:30:53

by Alex Deucher

[permalink] [raw]
Subject: Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

On Thu, Aug 8, 2019 at 4:13 AM Michel Dänzer <[email protected]> wrote:
>
> On 2019-08-08 7:31 a.m., Alex Deucher wrote:
> > On Wed, Aug 7, 2019 at 11:49 PM Mikhail Gavrilov
> > <[email protected]> wrote:
> >>
> >> Unfortunately error "gnome-shell: page allocation failure: order:4,
> >> mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
> >> nodemask=(null),cpuset=/,mems_allowed=0" still happens even with
> >> applying this patch.
> >
> > I think we can just drop the kmalloc altogether. How about this patch?
>
> Memory allocated by kvz/malloc needs to be freed with kvfree.
>

Yup, good catch. Updated patch attached.

Alex


Attachments:
0001-drm-amd-display-use-kvmalloc-for-dc_state-v2.patch (1.87 kB)

2019-08-10 16:09:50

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

On Fri, 9 Aug 2019 at 23:55, Mikhail Gavrilov
<[email protected]> wrote:
> Finally initial problem "gnome-shell: page allocation failure:
> order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
> nodemask=(null),cpuset=/,mems_allowed=0" did not happens anymore with
> latest version of the patch (I tested more than 23 hours)
>
> But I hit a new problem:
>
> [73808.088801] ------------[ cut here ]------------
> [73808.088806] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock)
> [73808.088813] WARNING: CPU: 8 PID: 1348877 at
> kernel/locking/mutex.c:757 __ww_mutex_lock.constprop.0+0xb0f/0x10c0

[pruned]

> So I needed to report it separately (in another thread) or we continue here?

Today after reboot issue "DEBUG LOCKS
WARN_ON(ww_ctx->contending_lock)" happened again.

--
Best Regards,
Mike Gavrilov.


Attachments:
dmesg2.txt (5.74 kB)