2019-11-18 21:42:00

by Qian Cai

[permalink] [raw]
Subject: PSI: use-after-free in collect_percpu_times()

Since a few days ago, s390 starts to crash on linux-next while reading some
sysfs. It is not always reproducible but seems pretty reproducible after running
the whole MM test suite here,
https://github.com/cailca/linux-mm/blob/master/test.sh

the config:
https://raw.githubusercontent.com/cailca/linux-mm/master/s390.config

The stack trace on s390 is not particular helpful as both gdb and faddr2line are
unable to point out which line causes the issue.

# ./scripts/faddr2line vmlinux collect_percpu_times+0x2d6/0x798
bad symbol size: base: 0x00000000002076f8 end: 0x00000000002076f8

(gdb) list *(collect_percpu_times+0x2d6)
0x2079ce is in collect_percpu_times (./include/linux/compiler.h:199).
194 })
195
196 static __always_inline
197 void __read_once_size(const volatile void *p, void *res, int size)
198 {
199 __READ_ONCE_SIZE;
200 }
201
202 #ifdef CONFIG_KASAN
203 /*

Could it be some race conditions in PSI?

01: [ 9499.451788] LTP: starting read_all_sys (read_all -d /sys -q -r 10)       
(zcon 17:55:45) [-- MARK --]
00: [ 9777.709470] =============================================================
00: =====                                                                       
00: [ 9777.709515] BUG: KASAN: use-after-free in collect_percpu_times+0x2d6/0x79
00: 8                                                                           
00: [ 9777.709530] Read of size 4 at addr 0000000066229000 by task read_all/5024
00: 0                                                                           
00: [ 9777.709543]                                                              
00: [ 9777.709559] CPU: 0 PID: 50240 Comm: read_all Not tainted 5.4.0-rc7-next-2
00: 0191118 #4                                                                  
00: [ 9777.709573] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)                 
00: [ 9777.709586] Call Trace:                                                  
00: [ 9777.709604] ([<000000000568e202>] show_stack+0x11a/0x1b8)                
00: [ 9777.709624]  [<0000000005f7568e>] dump_stack+0x136/0x188                 
00: [ 9777.709641]  [<0000000005a16594>] print_address_description.isra.8+0x64/0
00: x3f0                                                                        
00: [ 9777.709657]  [<0000000005a16b6a>] __kasan_report+0x162/0x188             
00: [ 9777.709672]  [<00000000057779ce>] collect_percpu_times+0x2d6/0x798       
00: [ 9777.709687]  [<000000000577a10c>] psi_show+0x84/0x2b0                    
00: [ 9777.709703]  [<0000000005abf18c>] seq_read+0x2b4/0x838                   
00: [ 9777.709718]  [<0000000005a75502>] vfs_read+0x9a/0x158                    
00: [ 9777.709732]  [<0000000005a759ea>] ksys_read+0xea/0x190                   
00: [ 9777.709748]  [<0000000005fb3100>] system_call+0xd8/0x2b4                 
00: [ 9777.709759] INFO: lockdep is turned off.                                 
00: [ 9777.709769]                                                              
00: [ 9777.709778] The buggy address belongs to the page:                       
00: [ 9777.709793] page:000003d081988a40 refcount:0 mapcount:-2048 mapping:00000
00: 00000000000 index:0x0                                                       
00: [ 9777.709813] raw: 003fffc000000000 000003d081988a48 000003d081988a48 00000
00: 00000000000                                                                 
00: [ 9777.709831] raw: 0000000000000000 0000000000000000 fffff7ff00000000 00000
00: 00000000000                                                                 
00: [ 9777.710042] page dumped because: kasan: bad access detected              
00: [ 9777.710057] page_owner tracks the page as freed                          
00: [ 9777.710075] page last allocated via order 2, migratetype Unmovable, gfp_m
00: ask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP)               
00: [ 9777.710145]  stack_trace_save+0xc2/0xd8                                  
00: [ 9777.710158]  register_dummy_stack+0x94/0xc0                              
00: [ 9777.710170]  init_page_owner+0x5c/0x518                                  
00: [ 9777.710188]  kernel_init_freeable+0x2a4/0x5a8                            
00: [ 9777.710199] page last free stack trace:                                  
00: [ 9777.710212]  stack_trace_save+0xc2/0xd8                                  
00: [ 9777.710224]  register_dummy_stack+0x94/0xc0                              
00: [ 9777.710236]  init_page_owner+0x5c/0x518                                  
00: [ 9777.710249]  kernel_init_freeable+0x2a4/0x5a8                            
00: [ 9777.710259]                                                              
00: [ 9777.710267] Memory state around the buggy address:                       
00: [ 9777.710282]  0000000066228f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
00: fc fc                                                                       
00: [ 9777.710297]  0000000066228f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
00: fc fc                                                                       
00: [ 9777.710312] >0000000066229000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00: ff ff                                                                       
00: [ 9777.710325]                    ^                                         
00: [ 9777.710337]  0000000066229080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00: ff ff                                                                       
00: [ 9777.710352]  0000000066229100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
00: ff ff                                                                       
00: [ 9777.710366] =============================================================
00: =====                                                                       
00: [ 9777.710389] Unable to handle kernel pointer dereference in virtual kernel
00:  address space                                                              
00: [ 9777.710580] Failing address: 0000000066229000 TEID: 0000000066229803     
00: [ 9777.710595] Fault in home space mode while using kernel ASCE.            
00: [ 9777.710613] AS:0000000007084007 R3:000000006dc28007 S:000000006d8fa800 P:
00: 000000006622943d                                                            
00: [ 9777.710700] Oops: 0011 ilc:2 [#1] SMP DEBUG_PAGEALLOC                    
00: [ 9777.710719] Modules linked in: brd ext4 crc16 mbcache jbd2 loop ip_tables
00:  x_tables xfs dasd_fba_mod dasd_eckd_mod dm_mirror dm_region_hash dm_log dm_
00: mod                                                                         
00: [ 9777.710778] CPU: 0 PID: 50240 Comm: read_all Tainted: G    B             
00: 5.4.0-rc7-next-20191118 #4                                                  
00: [ 9777.710794] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)                 
00: [ 9777.710806] Krnl PSW : 0704c00180000000 00000000057779ce (collect_percpu_
00: times+0x2d6/0x798)                                                          
00: [ 9777.710832]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
00:  RI:0 EA:3                                                                  
00: [ 9777.710849] Krnl GPRS: 0000000000000001 000000004a8b2100 00000000063315ac
00:  0000000000000007                                                           
00: [ 9777.710865]            0000030000000000 000000000578e0be 0000000000000001
00:  000000003409fc70                                                           
00: [ 9777.710881]            000000000577a10c 070000003409fb00 0000000066229008
00:  0000000066229000                                                           
00: [ 9777.710896]            000000000639f6b0 0000000005ff5d30 00000000057779ce
00:  000000003409fa20                                                           
00: [ 9777.710939] Krnl Code: 00000000057779c0: ad031000            stosm   0(%r
00: 1),3                                                                        
00: [ 9777.710939]            00000000057779c4: b904002b            lgr     %r2,
00: %r11                                                                        
00: [ 9777.710939]           #00000000057779c8: c0e50014fc9c        brasl   %r14
00: ,0000000005a17300                                                           
00: [ 9777.710939]           >00000000057779ce: 5880b000            l       %r8,
00: 0(%r11)                                                                     
00: [ 9777.710939]            00000000057779d2: a7810001            tmll    %r8,
00: 1                                                                           
00: [ 9777.710939]            00000000057779d6: a784ff97            brc     8,00
00: 00000005777904                                                              
00: [ 9777.710939]            00000000057779da: a7f4ff8a            brc     15,0
00: 0000000057778ee                                                             
00: [ 9777.710939]            00000000057779de: e310f1300004        lg      %r1,
00: 304(%r15)                                                                   
00: [ 9777.711135] Call Trace:                                                  
00: [ 9777.711149] ([<00000000057779ce>] collect_percpu_times+0x2d6/0x798)      
00: [ 9777.711164]  [<000000000577a10c>] psi_show+0x84/0x2b0                    
00: [ 9777.711180]  [<0000000005abf18c>] seq_read+0x2b4/0x838                   
00: [ 9777.711194]  [<0000000005a75502>] vfs_read+0x9a/0x158                    
00: [ 9777.711207]  [<0000000005a759ea>] ksys_read+0xea/0x190                   
00: [ 9777.711222]  [<0000000005fb3100>] system_call+0xd8/0x2b4                 
00: [ 9777.711234] INFO: lockdep is turned off.                                 
00: [ 9777.711244] Last Breaking-Event-Address:                                 
00: [ 9777.711257]  [<0000000005a16a70>] __kasan_report+0x68/0x188              
00: [ 9777.711273] Kernel panic - not syncing: Fatal exception: panic_on_oops


2019-11-18 22:00:48

by Qian Cai

[permalink] [raw]
Subject: Re: PSI: use-after-free in collect_percpu_times()

Wondering if this is cgroup v2 related, as the systems is running with
systemd.unified_cgroup_hierarchy=1 and accessing PSI via sysfs trigger this...

On Mon, 2019-11-18 at 16:39 -0500, Qian Cai wrote:
> Since a few days ago, s390 starts to crash on linux-next while reading some
> sysfs. It is not always reproducible but seems pretty reproducible after running
> the whole MM test suite here,
> https://github.com/cailca/linux-mm/blob/master/test.sh
>
> the config:
> https://raw.githubusercontent.com/cailca/linux-mm/master/s390.config
>
> The stack trace on s390 is not particular helpful as both gdb and faddr2line are
> unable to point out which line causes the issue.
>
> # ./scripts/faddr2line vmlinux collect_percpu_times+0x2d6/0x798
> bad symbol size: base: 0x00000000002076f8 end: 0x00000000002076f8
>
> (gdb) list *(collect_percpu_times+0x2d6)
> 0x2079ce is in collect_percpu_times (./include/linux/compiler.h:199).
> 194 })
> 195
> 196 static __always_inline
> 197 void __read_once_size(const volatile void *p, void *res, int size)
> 198 {
> 199 __READ_ONCE_SIZE;
> 200 }
> 201
> 202 #ifdef CONFIG_KASAN
> 203 /*
>
> Could it be some race conditions in PSI?
>
> 01: [ 9499.451788] LTP: starting read_all_sys (read_all -d /sys -q -r 10)       
> (zcon 17:55:45) [-- MARK --]
> 00: [ 9777.709470] =============================================================
> 00: =====                                                                       
> 00: [ 9777.709515] BUG: KASAN: use-after-free in collect_percpu_times+0x2d6/0x79
> 00: 8                                                                           
> 00: [ 9777.709530] Read of size 4 at addr 0000000066229000 by task read_all/5024
> 00: 0                                                                           
> 00: [ 9777.709543]                                                              
> 00: [ 9777.709559] CPU: 0 PID: 50240 Comm: read_all Not tainted 5.4.0-rc7-next-2
> 00: 0191118 #4                                                                  
> 00: [ 9777.709573] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)                 
> 00: [ 9777.709586] Call Trace:                                                  
> 00: [ 9777.709604] ([<000000000568e202>] show_stack+0x11a/0x1b8)                
> 00: [ 9777.709624]  [<0000000005f7568e>] dump_stack+0x136/0x188                 
> 00: [ 9777.709641]  [<0000000005a16594>] print_address_description.isra.8+0x64/0
> 00: x3f0                                                                        
> 00: [ 9777.709657]  [<0000000005a16b6a>] __kasan_report+0x162/0x188             
> 00: [ 9777.709672]  [<00000000057779ce>] collect_percpu_times+0x2d6/0x798       
> 00: [ 9777.709687]  [<000000000577a10c>] psi_show+0x84/0x2b0                    
> 00: [ 9777.709703]  [<0000000005abf18c>] seq_read+0x2b4/0x838                   
> 00: [ 9777.709718]  [<0000000005a75502>] vfs_read+0x9a/0x158                    
> 00: [ 9777.709732]  [<0000000005a759ea>] ksys_read+0xea/0x190                   
> 00: [ 9777.709748]  [<0000000005fb3100>] system_call+0xd8/0x2b4                 
> 00: [ 9777.709759] INFO: lockdep is turned off.                                 
> 00: [ 9777.709769]                                                              
> 00: [ 9777.709778] The buggy address belongs to the page:                       
> 00: [ 9777.709793] page:000003d081988a40 refcount:0 mapcount:-2048 mapping:00000
> 00: 00000000000 index:0x0                                                       
> 00: [ 9777.709813] raw: 003fffc000000000 000003d081988a48 000003d081988a48 00000
> 00: 00000000000                                                                 
> 00: [ 9777.709831] raw: 0000000000000000 0000000000000000 fffff7ff00000000 00000
> 00: 00000000000                                                                 
> 00: [ 9777.710042] page dumped because: kasan: bad access detected              
> 00: [ 9777.710057] page_owner tracks the page as freed                          
> 00: [ 9777.710075] page last allocated via order 2, migratetype Unmovable, gfp_m
> 00: ask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP)               
> 00: [ 9777.710145]  stack_trace_save+0xc2/0xd8                                  
> 00: [ 9777.710158]  register_dummy_stack+0x94/0xc0                              
> 00: [ 9777.710170]  init_page_owner+0x5c/0x518                                  
> 00: [ 9777.710188]  kernel_init_freeable+0x2a4/0x5a8                            
> 00: [ 9777.710199] page last free stack trace:                                  
> 00: [ 9777.710212]  stack_trace_save+0xc2/0xd8                                  
> 00: [ 9777.710224]  register_dummy_stack+0x94/0xc0                              
> 00: [ 9777.710236]  init_page_owner+0x5c/0x518                                  
> 00: [ 9777.710249]  kernel_init_freeable+0x2a4/0x5a8                            
> 00: [ 9777.710259]                                                              
> 00: [ 9777.710267] Memory state around the buggy address:                       
> 00: [ 9777.710282]  0000000066228f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> 00: fc fc                                                                       
> 00: [ 9777.710297]  0000000066228f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
> 00: fc fc                                                                       
> 00: [ 9777.710312] >0000000066229000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> 00: ff ff                                                                       
> 00: [ 9777.710325]                    ^                                         
> 00: [ 9777.710337]  0000000066229080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> 00: ff ff                                                                       
> 00: [ 9777.710352]  0000000066229100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
> 00: ff ff                                                                       
> 00: [ 9777.710366] =============================================================
> 00: =====                                                                       
> 00: [ 9777.710389] Unable to handle kernel pointer dereference in virtual kernel
> 00:  address space                                                              
> 00: [ 9777.710580] Failing address: 0000000066229000 TEID: 0000000066229803     
> 00: [ 9777.710595] Fault in home space mode while using kernel ASCE.            
> 00: [ 9777.710613] AS:0000000007084007 R3:000000006dc28007 S:000000006d8fa800 P:
> 00: 000000006622943d                                                            
> 00: [ 9777.710700] Oops: 0011 ilc:2 [#1] SMP DEBUG_PAGEALLOC                    
> 00: [ 9777.710719] Modules linked in: brd ext4 crc16 mbcache jbd2 loop ip_tables
> 00:  x_tables xfs dasd_fba_mod dasd_eckd_mod dm_mirror dm_region_hash dm_log dm_
> 00: mod                                                                         
> 00: [ 9777.710778] CPU: 0 PID: 50240 Comm: read_all Tainted: G    B             
> 00: 5.4.0-rc7-next-20191118 #4                                                  
> 00: [ 9777.710794] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)                 
> 00: [ 9777.710806] Krnl PSW : 0704c00180000000 00000000057779ce (collect_percpu_
> 00: times+0x2d6/0x798)                                                          
> 00: [ 9777.710832]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> 00:  RI:0 EA:3                                                                  
> 00: [ 9777.710849] Krnl GPRS: 0000000000000001 000000004a8b2100 00000000063315ac
> 00:  0000000000000007                                                           
> 00: [ 9777.710865]            0000030000000000 000000000578e0be 0000000000000001
> 00:  000000003409fc70                                                           
> 00: [ 9777.710881]            000000000577a10c 070000003409fb00 0000000066229008
> 00:  0000000066229000                                                           
> 00: [ 9777.710896]            000000000639f6b0 0000000005ff5d30 00000000057779ce
> 00:  000000003409fa20                                                           
> 00: [ 9777.710939] Krnl Code: 00000000057779c0: ad031000            stosm   0(%r
> 00: 1),3                                                                        
> 00: [ 9777.710939]            00000000057779c4: b904002b            lgr     %r2,
> 00: %r11                                                                        
> 00: [ 9777.710939]           #00000000057779c8: c0e50014fc9c        brasl   %r14
> 00: ,0000000005a17300                                                           
> 00: [ 9777.710939]           >00000000057779ce: 5880b000            l       %r8,
> 00: 0(%r11)                                                                     
> 00: [ 9777.710939]            00000000057779d2: a7810001            tmll    %r8,
> 00: 1                                                                           
> 00: [ 9777.710939]            00000000057779d6: a784ff97            brc     8,00
> 00: 00000005777904                                                              
> 00: [ 9777.710939]            00000000057779da: a7f4ff8a            brc     15,0
> 00: 0000000057778ee                                                             
> 00: [ 9777.710939]            00000000057779de: e310f1300004        lg      %r1,
> 00: 304(%r15)                                                                   
> 00: [ 9777.711135] Call Trace:                                                  
> 00: [ 9777.711149] ([<00000000057779ce>] collect_percpu_times+0x2d6/0x798)      
> 00: [ 9777.711164]  [<000000000577a10c>] psi_show+0x84/0x2b0                    
> 00: [ 9777.711180]  [<0000000005abf18c>] seq_read+0x2b4/0x838                   
> 00: [ 9777.711194]  [<0000000005a75502>] vfs_read+0x9a/0x158                    
> 00: [ 9777.711207]  [<0000000005a759ea>] ksys_read+0xea/0x190                   
> 00: [ 9777.711222]  [<0000000005fb3100>] system_call+0xd8/0x2b4                 
> 00: [ 9777.711234] INFO: lockdep is turned off.                                 
> 00: [ 9777.711244] Last Breaking-Event-Address:                                 
> 00: [ 9777.711257]  [<0000000005a16a70>] __kasan_report+0x68/0x188              
> 00: [ 9777.711273] Kernel panic - not syncing: Fatal exception: panic_on_oops

2019-11-18 22:03:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: PSI: use-after-free in collect_percpu_times()

Hi Qian,

On Mon, Nov 18, 2019 at 04:39:19PM -0500, Qian Cai wrote:
> Since a few days ago, s390 starts to crash on linux-next while reading some
> sysfs. It is not always reproducible but seems pretty reproducible after running
> the whole MM test suite here,
> https://github.com/cailca/linux-mm/blob/master/test.sh
>
> the config:
> https://raw.githubusercontent.com/cailca/linux-mm/master/s390.config
>
> The stack trace on s390 is not particular helpful as both gdb and faddr2line are
> unable to point out which line causes the issue.
>
> # ./scripts/faddr2line vmlinux collect_percpu_times+0x2d6/0x798
> bad symbol size: base: 0x00000000002076f8 end: 0x00000000002076f8
>
> (gdb) list *(collect_percpu_times+0x2d6)
> 0x2079ce is in collect_percpu_times (./include/linux/compiler.h:199).
> 194 })
> 195
> 196 static __always_inline
> 197 void __read_once_size(const volatile void *p, void *res, int size)
> 198 {
> 199 __READ_ONCE_SIZE;
> 200 }
> 201
> 202 #ifdef CONFIG_KASAN
> 203 /*
>
> Could it be some race conditions in PSI?

psi doesn't do much lifetime management in itself: the psi_group is
embedded in the cgroup and the per-cpu data is freed right before the
cgroup itself is freed. An open file descriptor on the pressure files
will pin the cgroup and prevent it from being deleted.

As it's reproducible, would you be able to bisect this problem?

2019-11-18 22:08:08

by Qian Cai

[permalink] [raw]
Subject: Re: PSI: use-after-free in collect_percpu_times()

On Mon, 2019-11-18 at 17:00 -0500, Johannes Weiner wrote:
> Hi Qian,
>
> On Mon, Nov 18, 2019 at 04:39:19PM -0500, Qian Cai wrote:
> > Since a few days ago, s390 starts to crash on linux-next while reading some
> > sysfs. It is not always reproducible but seems pretty reproducible after running
> > the whole MM test suite here,
> > https://github.com/cailca/linux-mm/blob/master/test.sh
> >
> > the config:
> > https://raw.githubusercontent.com/cailca/linux-mm/master/s390.config
> >
> > The stack trace on s390 is not particular helpful as both gdb and faddr2line are
> > unable to point out which line causes the issue.
> >
> > # ./scripts/faddr2line vmlinux collect_percpu_times+0x2d6/0x798
> > bad symbol size: base: 0x00000000002076f8 end: 0x00000000002076f8
> >
> > (gdb) list *(collect_percpu_times+0x2d6)
> > 0x2079ce is in collect_percpu_times (./include/linux/compiler.h:199).
> > 194 })
> > 195
> > 196 static __always_inline
> > 197 void __read_once_size(const volatile void *p, void *res, int size)
> > 198 {
> > 199 __READ_ONCE_SIZE;
> > 200 }
> > 201
> > 202 #ifdef CONFIG_KASAN
> > 203 /*
> >
> > Could it be some race conditions in PSI?
>
> psi doesn't do much lifetime management in itself: the psi_group is
> embedded in the cgroup and the per-cpu data is freed right before the
> cgroup itself is freed. An open file descriptor on the pressure files
> will pin the cgroup and prevent it from being deleted.
>
> As it's reproducible, would you be able to bisect this problem?

Yes, it is going to time-consuming though as I have not found a quick reproducer
yet.