2014-02-11 17:27:18

by Dave Jones

[permalink] [raw]
Subject: 3.14-rc2 XFS backtrace because irqs_disabled.

BUG: sleeping function called from invalid context at mm/mempool.c:203
in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
5 locks held by trinity-c3/27511:
#0: (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
#1: (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
#2: (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
#3: (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
#4: (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111
ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
Call Trace:
[<ffffffff8372afea>] dump_stack+0x4e/0x7a
[<ffffffff8309ddb5>] __might_sleep+0x105/0x150
[<ffffffff831534d3>] mempool_alloc+0xa3/0x170
[<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
[<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
[<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
[<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
[<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
[<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
[<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
[<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
[<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
[<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
[<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
[<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
[<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
[<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
[<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
[<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
[<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
[<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
[<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
[<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
[<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
[<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
[<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
[<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
[<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
[<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
[<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
[<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
[<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
[<ffffffff831ccc15>] vfs_create+0x95/0xc0
[<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
[<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
[<ffffffff831cdef9>] path_openat+0xc9/0x620
[<ffffffff8331cc02>] ? put_dec+0x72/0x90
[<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
[<ffffffff831bc4ae>] file_open_name+0xfe/0x160
[<ffffffff831bc554>] filp_open+0x44/0x60
[<ffffffff83221862>] do_coredump+0x602/0xf60
[<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
[<ffffffff830024d7>] do_signal+0x57/0x9d0
[<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
[<ffffffff83730000>] ? __schedule+0x60/0x850
[<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
[<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
[<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
[<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
[<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
[<ffffffff83736746>] retint_signal+0x46/0x90
------------[ cut here ]------------
WARNING: CPU: 3 PID: 27511 at block/blk.h:227 generic_make_request_checks+0x33f/0x460()
Modules linked in: bridge stp tun snd_seq_dummy fuse hidp bnep rfcomm nf_conntrack_netlink nf_conntrack scsi_transport_iscsi nfnetlink can_raw nfc pn_pep caif_socket caif af_802154 phonet af_rxrpc can_bcm can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs libcrc32c coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic serio_raw microcode pcspkr snd_hda_intel snd_hda_codec usb_debug btusb snd_hwdep snd_seq snd_seq_device bluetooth 6lowpan_iphc rfkill snd_pcm snd_timer snd soundcore e1000e ptp shpchp pps_core
CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111
0000000000000009 0000000045a3849a ffff88009f368f10 ffffffff8372afea
0000000000000000 ffff88009f368f48 ffffffff8306d0cd ffff88023933ed00
00000000ffffffff 00000000197f7510 0000000000800005 ffff880240913740
Call Trace:
[<ffffffff8372afea>] dump_stack+0x4e/0x7a
[<ffffffff8306d0cd>] warn_slowpath_common+0x7d/0xa0
[<ffffffff8306d1fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff832e81bf>] generic_make_request_checks+0x33f/0x460
[<ffffffff832e8307>] generic_make_request+0x27/0x130
[<ffffffff832e8488>] submit_bio+0x78/0x160
[<ffffffff831f9000>] ? bio_alloc_bioset+0x180/0x210
[<ffffffffc030832b>] _xfs_buf_ioapply+0x2bb/0x3c0 [xfs]
[<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
[<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
[<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
[<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
[<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
[<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
[<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
[<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
[<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
[<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
[<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
[<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
[<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
[<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
[<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
[<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
[<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
[<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
[<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
[<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
[<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
[<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
[<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
[<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
[<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
[<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
[<ffffffff831ccc15>] vfs_create+0x95/0xc0
[<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
[<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
[<ffffffff831cdef9>] path_openat+0xc9/0x620
[<ffffffff8331cc02>] ? put_dec+0x72/0x90
[<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
[<ffffffff831bc4ae>] file_open_name+0xfe/0x160
[<ffffffff831bc554>] filp_open+0x44/0x60
[<ffffffff83221862>] do_coredump+0x602/0xf60
[<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
[<ffffffff830024d7>] do_signal+0x57/0x9d0
[<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
[<ffffffff83730000>] ? __schedule+0x60/0x850
[<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
[<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
[<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
[<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
[<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
[<ffffffff83736746>] retint_signal+0x46/0x90
---[ end trace e6e48f6461c98e81 ]---


2014-02-11 21:08:49

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
> BUG: sleeping function called from invalid context at mm/mempool.c:203
> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
> 5 locks held by trinity-c3/27511:
> #0: (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
> #1: (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
> #2: (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
> #3: (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
> #4: (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111
> ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
> 0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
> ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
> Call Trace:
> [<ffffffff8372afea>] dump_stack+0x4e/0x7a
> [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
> [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
> [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
> [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
> [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
> [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
> [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
> [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
> [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
> [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
> [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
> [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
> [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
> [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
> [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
> [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
> [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
> [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
> [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
> [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
> [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
> [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
> [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
> [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
> [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
> [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
> [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
> [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
> [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
> [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
> [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
> [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
> [<ffffffff831ccc15>] vfs_create+0x95/0xc0
> [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
> [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
> [<ffffffff831cdef9>] path_openat+0xc9/0x620
> [<ffffffff8331cc02>] ? put_dec+0x72/0x90
> [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
> [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
> [<ffffffff831bc554>] filp_open+0x44/0x60
> [<ffffffff83221862>] do_coredump+0x602/0xf60
> [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
> [<ffffffff830024d7>] do_signal+0x57/0x9d0
> [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
> [<ffffffff83730000>] ? __schedule+0x60/0x850
> [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
> [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
> [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
> [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
> [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
> [<ffffffff83736746>] retint_signal+0x46/0x90

There's nowhere in this XFS stack that disables interrupts, so I
thinks it's either above or below XFS that is causing this problem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-11 21:49:17

by Eric Sandeen

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 2/11/14, 3:08 PM, Dave Chinner wrote:
> On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
>> BUG: sleeping function called from invalid context at mm/mempool.c:203
>> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
>> 5 locks held by trinity-c3/27511:
>> #0: (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
>> #1: (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
>> #2: (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
>> #3: (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
>> #4: (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
>> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111
>> ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
>> 0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
>> ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
>> Call Trace:
>> [<ffffffff8372afea>] dump_stack+0x4e/0x7a
>> [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
>> [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
>> [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
>> [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
>> [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
>> [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
>> [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
>> [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
>> [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
>> [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
>> [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
>> [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
>> [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
>> [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
>> [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
>> [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
>> [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
>> [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
>> [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
>> [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
>> [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
>> [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
>> [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
>> [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
>> [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
>> [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
>> [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
>> [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
>> [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
>> [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
>> [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
>> [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
>> [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
>> [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
>> [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
>> [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
>> [<ffffffff831ccc15>] vfs_create+0x95/0xc0
>> [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
>> [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
>> [<ffffffff831cdef9>] path_openat+0xc9/0x620
>> [<ffffffff8331cc02>] ? put_dec+0x72/0x90
>> [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
>> [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
>> [<ffffffff831bc554>] filp_open+0x44/0x60
>> [<ffffffff83221862>] do_coredump+0x602/0xf60
>> [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
>> [<ffffffff830024d7>] do_signal+0x57/0x9d0
>> [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
>> [<ffffffff83730000>] ? __schedule+0x60/0x850
>> [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
>> [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
>> [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
>> [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
>> [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
>> [<ffffffff83736746>] retint_signal+0x46/0x90
>
> There's nowhere in this XFS stack that disables interrupts, so I
> thinks it's either above or below XFS that is causing this problem.

any chance the stack got blown & corrupted things such that it
only _looks_ like irqs are disabled?

-Eric

> Cheers,
>
> Dave.
>

2014-02-12 00:44:17

by Dave Jones

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 03:49:14PM -0600, Eric Sandeen wrote:
> On 2/11/14, 3:08 PM, Dave Chinner wrote:
> > On Tue, Feb 11, 2014 at 12:27:07PM -0500, Dave Jones wrote:
> >> BUG: sleeping function called from invalid context at mm/mempool.c:203
> >> in_atomic(): 0, irqs_disabled(): 1, pid: 27511, name: trinity-c3
> >> 5 locks held by trinity-c3/27511:
> >> #0: (sb_writers#9){......}, at: [<ffffffff831df3b4>] mnt_want_write+0x24/0x50
> >> #1: (&type->i_mutex_dir_key#3){......}, at: [<ffffffff831cced4>] do_last.isra.51+0x294/0x11f0
> >> #2: (sb_internal#2){......}, at: [<ffffffffc03225c4>] xfs_trans_alloc+0x24/0x40 [xfs]
> >> #3: (&(&ip->i_lock)->mr_lock/1){......}, at: [<ffffffffc036447f>] xfs_ilock+0x16f/0x1b0 [xfs]
> >> #4: (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc03646d4>] xfs_ilock_nowait+0x184/0x200 [xfs]
> >> CPU: 3 PID: 27511 Comm: trinity-c3 Not tainted 3.14.0-rc2+ #111
> >> ffffffff83a3fd56 0000000045a3849a ffff88009f368f60 ffffffff8372afea
> >> 0000000000000000 ffff88009f368f88 ffffffff8309ddb5 0000000000000010
> >> ffff880243566288 0000000000000008 ffff88009f369008 ffffffff831534d3
> >> Call Trace:
> >> [<ffffffff8372afea>] dump_stack+0x4e/0x7a
> >> [<ffffffff8309ddb5>] __might_sleep+0x105/0x150
> >> [<ffffffff831534d3>] mempool_alloc+0xa3/0x170
> >> [<ffffffff831a955a>] ? deactivate_slab+0x51a/0x590
> >> [<ffffffff831f8fd6>] bio_alloc_bioset+0x156/0x210
> >> [<ffffffffc0308231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> >> [<ffffffffc03798f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> >> [<ffffffffc030849b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> >> [<ffffffffc03798f2>] xlog_bdstrat+0x22/0x60 [xfs]
> >> [<ffffffffc037ba87>] xlog_sync+0x3a7/0x5b0 [xfs]
> >> [<ffffffffc037bd9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> >> [<ffffffffc037c840>] xlog_write+0x6f0/0x800 [xfs]
> >> [<ffffffffc037e061>] xlog_cil_push+0x2f1/0x410 [xfs]
> >> [<ffffffffc037e948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
> >> [<ffffffffc03428a8>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
> >> [<ffffffffc037cc40>] _xfs_log_force+0x70/0x290 [xfs]
> >> [<ffffffff830a4edd>] ? get_parent_ip+0xd/0x50
> >> [<ffffffffc037ce86>] xfs_log_force+0x26/0xb0 [xfs]
> >> [<ffffffffc03076e6>] ? _xfs_buf_find+0x1f6/0x3c0 [xfs]
> >> [<ffffffffc03074e3>] xfs_buf_lock+0x133/0x140 [xfs]
> >> [<ffffffffc03076e6>] _xfs_buf_find+0x1f6/0x3c0 [xfs]
> >> [<ffffffffc030796a>] xfs_buf_get_map+0x2a/0x1b0 [xfs]
> >> [<ffffffffc0383c61>] xfs_trans_get_buf_map+0x1a1/0x240 [xfs]
> >> [<ffffffffc034e03d>] xfs_da_get_buf+0xbd/0x100 [xfs]
> >> [<ffffffffc0356869>] xfs_dir3_data_init+0x59/0x1d0 [xfs]
> >> [<ffffffffc03526db>] ? xfs_dir2_grow_inode+0x13b/0x150 [xfs]
> >> [<ffffffffc0354ede>] xfs_dir2_sf_to_block+0x17e/0x7b0 [xfs]
> >> [<ffffffffc0351f5a>] ? xfs_dir2_sfe_get_ino+0x1a/0x20 [xfs]
> >> [<ffffffffc035e9b4>] ? xfs_dir2_sf_check.isra.7+0x114/0x1b0 [xfs]
> >> [<ffffffffc034c24f>] ? xfs_da_compname+0x1f/0x30 [xfs]
> >> [<ffffffffc035f503>] ? xfs_dir2_sf_lookup+0x303/0x310 [xfs]
> >> [<ffffffffc035f858>] xfs_dir2_sf_addname+0x348/0x6d0 [xfs]
> >> [<ffffffffc031970d>] ? xfs_setup_inode+0x1cd/0x320 [xfs]
> >> [<ffffffffc03529a4>] xfs_dir_createname+0x184/0x1e0 [xfs]
> >> [<ffffffffc03663d9>] xfs_create+0x469/0x580 [xfs]
> >> [<ffffffffc0318864>] xfs_vn_mknod+0xc4/0x1e0 [xfs]
> >> [<ffffffffc0318993>] xfs_vn_create+0x13/0x20 [xfs]
> >> [<ffffffff831ccc15>] vfs_create+0x95/0xc0
> >> [<ffffffff831cd638>] do_last.isra.51+0x9f8/0x11f0
> >> [<ffffffff831c9e11>] ? link_path_walk+0x81/0x870
> >> [<ffffffff831cdef9>] path_openat+0xc9/0x620
> >> [<ffffffff8331cc02>] ? put_dec+0x72/0x90
> >> [<ffffffff831ceb6d>] do_filp_open+0x4d/0xb0
> >> [<ffffffff831bc4ae>] file_open_name+0xfe/0x160
> >> [<ffffffff831bc554>] filp_open+0x44/0x60
> >> [<ffffffff83221862>] do_coredump+0x602/0xf60
> >> [<ffffffff83081358>] get_signal_to_deliver+0x2b8/0x6b0
> >> [<ffffffff830024d7>] do_signal+0x57/0x9d0
> >> [<ffffffff8311115e>] ? __acct_update_integrals+0x8e/0x120
> >> [<ffffffff83730000>] ? __schedule+0x60/0x850
> >> [<ffffffff8373a1db>] ? preempt_count_sub+0x6b/0xf0
> >> [<ffffffff83735a31>] ? _raw_spin_unlock+0x31/0x50
> >> [<ffffffff830ab671>] ? vtime_account_user+0x91/0xa0
> >> [<ffffffff8314f57b>] ? context_tracking_user_exit+0x9b/0x100
> >> [<ffffffff83002eac>] do_notify_resume+0x5c/0xa0
> >> [<ffffffff83736746>] retint_signal+0x46/0x90
> >
> > There's nowhere in this XFS stack that disables interrupts, so I
> > thinks it's either above or below XFS that is causing this problem.
>
> any chance the stack got blown & corrupted things such that it
> only _looks_ like irqs are disabled?

The 'good' news is I seem to be able to reproduce it fairly quickly.
I've seen it 3-4 times this afternoon. (Always from the coredump path)

I was on vacation last week, but I wasn't hitting this before then, so it's a
fairly recent change that's introduced this.

Al ?

Dave

2014-02-12 01:09:47

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 07:44:03PM -0500, Dave Jones wrote:

> The 'good' news is I seem to be able to reproduce it fairly quickly.
> I've seen it 3-4 times this afternoon. (Always from the coredump path)
>
> I was on vacation last week, but I wasn't hitting this before then, so it's a
> fairly recent change that's introduced this.

Slap the check in vfs_create(), see if interrupts had been disabled by it or
by something in ->create(). Since it's reproducible...

2014-02-12 02:52:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 5:09 PM, Al Viro <[email protected]> wrote:
>
> Slap the check in vfs_create(), see if interrupts had been disabled by it or
> by something in ->create(). Since it's reproducible...

path_openat() starts off with a get_empty_filp(), which allocates a
file pointer with GFP_KERNEL. So that should have triggered the
might_sleep warning if irq's were already disabled at that point.

So it's not before that - in particular, it's not in the signal
handling or do_coredump() paths.

Also, at least xfs_buf_lock() - which is much deeper in that chain -
does a down(&bp->b_sema). I'm disguested that that doesn't have a
might_sleep() in it.

Dave, mind adding a "might_sleep()" to the top of
"down[_interruptible]()". It's silly to not have coverage of semaphore
use in bad contexts.

Linus

2014-02-12 04:04:14

by Dave Jones

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 06:52:19PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 5:09 PM, Al Viro <[email protected]> wrote:
> >
> > Slap the check in vfs_create(), see if interrupts had been disabled by it or
> > by something in ->create(). Since it's reproducible...
>
> path_openat() starts off with a get_empty_filp(), which allocates a
> file pointer with GFP_KERNEL. So that should have triggered the
> might_sleep warning if irq's were already disabled at that point.
>
> So it's not before that - in particular, it's not in the signal
> handling or do_coredump() paths.
>
> Also, at least xfs_buf_lock() - which is much deeper in that chain -
> does a down(&bp->b_sema). I'm disguested that that doesn't have a
> might_sleep() in it.
>
> Dave, mind adding a "might_sleep()" to the top of
> "down[_interruptible]()". It's silly to not have coverage of semaphore
> use in bad contexts.

Added those, didn't trigger. Neither did Al's suggestion.

Slightly different trace, but still from the coredump path.

Dave

[ 3111.403822] BUG: sleeping function called from invalid context at mm/mempool.c:203
[ 3111.404414] in_atomic(): 0, irqs_disabled(): 1, pid: 19213, name: trinity-c46
[ 3111.404884] 5 locks held by trinity-c46/19213:
[ 3111.405354] #0: (sb_writers#9){......}, at: [<ffffffff8d2220c5>] do_coredump+0xe05/0xf60
[ 3111.405862] #1: (shrinker_rwsem){......}, at: [<ffffffff8d1644af>] shrink_slab+0x3f/0x180
[ 3111.406374] #2: (&type->s_umount_key#30){......}, at: [<ffffffff8d1c0db4>] grab_super_passive+0x44/0x90
[ 3111.406905] #3: (&pag->pag_ici_reclaim_lock){......}, at: [<ffffffffc031eb1a>] xfs_reclaim_inodes_ag+0x31a/0x430 [xfs]
[ 3111.407466] #4: (&(&ip->i_lock)->mr_lock){......}, at: [<ffffffffc037047f>] xfs_ilock+0x16f/0x1b0 [xfs]
[ 3111.408039] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113
[ 3111.409779] ffffffff8da3fde6 000000004f998871 ffff88023715cce0 ffffffff8d72b091
[ 3111.410396] 0000000000000000 ffff88023715cd08 ffffffff8d09ddb5 0000000000000010
[ 3111.411021] ffff880243566288 0000000000000008 ffff88023715cd88 ffffffff8d1534f3
[ 3111.411656] Call Trace:
[ 3111.412283] [<ffffffff8d72b091>] dump_stack+0x4e/0x7a
[ 3111.412922] [<ffffffff8d09ddb5>] __might_sleep+0x105/0x150
[ 3111.413562] [<ffffffff8d1534f3>] mempool_alloc+0xa3/0x170
[ 3111.414202] [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
[ 3111.414855] [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
[ 3111.415517] [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.416175] [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.416843] [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.417509] [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.418175] [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.418846] [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
[ 3111.419518] [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.420195] [<ffffffffc038a948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.420865] [<ffffffff8d1a6cd9>] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.421551] [<ffffffffc0389413>] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.422239] [<ffffffffc03896ee>] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.422932] [<ffffffffc0374209>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.423625] [<ffffffffc0370b50>] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.424310] [<ffffffff8d0b9dc0>] ? autoremove_wake_function+0x40/0x40
[ 3111.425008] [<ffffffffc0374209>] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.425705] [<ffffffffc031e50c>] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.426405] [<ffffffffc031ea7b>] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.427104] [<ffffffffc031e900>] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.427797] [<ffffffffc031ed13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.428481] [<ffffffffc032afa5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.429150] [<ffffffff8d1c107c>] super_cache_scan+0x16c/0x180
[ 3111.429824] [<ffffffff8d16225b>] shrink_slab_node+0x14b/0x2e0
[ 3111.430489] [<ffffffff8d1644af>] ? shrink_slab+0x3f/0x180
[ 3111.431146] [<ffffffff8d1644fe>] shrink_slab+0x8e/0x180
[ 3111.431796] [<ffffffff8d1677c6>] try_to_free_pages+0x516/0x970
[ 3111.432436] [<ffffffff8d1f2c57>] ? __set_page_dirty+0x27/0xc0
[ 3111.433065] [<ffffffff8d15af19>] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.433689] [<ffffffff8d19e806>] alloc_pages_current+0x106/0x1f0
[ 3111.434304] [<ffffffff8d040cb7>] ? pte_alloc_one+0x17/0x80
[ 3111.434911] [<ffffffff8d040cb7>] pte_alloc_one+0x17/0x80
[ 3111.435510] [<ffffffff8d17b4a7>] __pte_alloc+0x27/0x130
[ 3111.436098] [<ffffffff8d17effc>] handle_mm_fault+0xafc/0xbb0
[ 3111.436681] [<ffffffff8d17f27e>] __get_user_pages+0x1ce/0x620
[ 3111.437254] [<ffffffff8d17f984>] get_dump_page+0x54/0x80
[ 3111.437810] [<ffffffff8d219da9>] elf_core_dump+0x12d9/0x1420
[ 3111.438356] [<ffffffff8d219330>] ? elf_core_dump+0x860/0x1420
[ 3111.438884] [<ffffffff8d221ec2>] do_coredump+0xc02/0xf60
[ 3111.439398] [<ffffffff8d081358>] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.439898] [<ffffffff8d0024d7>] do_signal+0x57/0x9d0
[ 3111.440386] [<ffffffff8d11117e>] ? __acct_update_integrals+0x8e/0x120
[ 3111.440873] [<ffffffff8d73a29b>] ? preempt_count_sub+0x6b/0xf0
[ 3111.441357] [<ffffffff8d735ae1>] ? _raw_spin_unlock+0x31/0x50
[ 3111.441833] [<ffffffff8d0ab671>] ? vtime_account_user+0x91/0xa0
[ 3111.442307] [<ffffffff8d14f59b>] ? context_tracking_user_exit+0x9b/0x100
[ 3111.442783] [<ffffffff8d002eac>] do_notify_resume+0x5c/0xa0
[ 3111.443260] [<ffffffff8d736806>] retint_signal+0x46/0x90
[ 3111.443773] ------------[ cut here ]------------
[ 3111.444248] WARNING: CPU: 0 PID: 19213 at block/blk.h:227 generic_make_request_checks+0x33f/0x460()
[ 3111.444742] Modules linked in: fuse can_bcm can_raw scsi_transport_iscsi ipt_ULOG nfnetlink nfc caif_socket caif af_802154 phonet af_rxrpc can pppoe pppox ppp_generic slhc irda crc_ccitt rds rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 xfs libcrc32c coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic microcode serio_raw pcspkr btusb bluetooth 6lowpan_iphc rfkill usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e ptp pps_core shpchp snd_timer snd soundcore
[ 3111.447600] CPU: 0 PID: 19213 Comm: trinity-c46 Not tainted 3.14.0-rc2+ #113
[ 3111.449427] 0000000000000009 000000004f998871 ffff88023715cc90 ffffffff8d72b091
[ 3111.450062] 0000000000000000 ffff88023715ccc8 ffffffff8d06d0cd ffff8801ef3a1440
[ 3111.450697] 00000000ffffffff 00000000197e5088 0000000000800005 ffff88024091b740
[ 3111.451339] Call Trace:
[ 3111.451968] [<ffffffff8d72b091>] dump_stack+0x4e/0x7a
[ 3111.452608] [<ffffffff8d06d0cd>] warn_slowpath_common+0x7d/0xa0
[ 3111.453247] [<ffffffff8d06d1fa>] warn_slowpath_null+0x1a/0x20
[ 3111.453882] [<ffffffff8d2e821f>] generic_make_request_checks+0x33f/0x460
[ 3111.454523] [<ffffffff8d2e8367>] generic_make_request+0x27/0x130
[ 3111.455159] [<ffffffff8d2e84e8>] submit_bio+0x78/0x160
[ 3111.455790] [<ffffffff8d1f9060>] ? bio_alloc_bioset+0x180/0x210
[ 3111.456432] [<ffffffffc031432b>] _xfs_buf_ioapply+0x2bb/0x3c0 [xfs]
[ 3111.457179] [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.457825] [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
[ 3111.458472] [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
[ 3111.459120] [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
[ 3111.459768] [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
[ 3111.460422] [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
[ 3111.461069] [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
[ 3111.461714] [<ffffffffc038a948>] xlog_cil_force_lsn+0x1d8/0x210 [xfs]
[ 3111.462349] [<ffffffff8d1a6cd9>] ? __bit_spin_unlock.constprop.66+0x19/0x40
[ 3111.463001] [<ffffffffc0389413>] _xfs_log_force_lsn+0x93/0x340 [xfs]
[ 3111.463654] [<ffffffffc03896ee>] xfs_log_force_lsn+0x2e/0xb0 [xfs]
[ 3111.464311] [<ffffffffc0374209>] ? xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.464966] [<ffffffffc0370b50>] __xfs_iunpin_wait+0xd0/0x1a0 [xfs]
[ 3111.465609] [<ffffffff8d0b9dc0>] ? autoremove_wake_function+0x40/0x40
[ 3111.466266] [<ffffffffc0374209>] xfs_iunpin_wait+0x19/0x20 [xfs]
[ 3111.466920] [<ffffffffc031e50c>] xfs_reclaim_inode+0x8c/0x380 [xfs]
[ 3111.467573] [<ffffffffc031ea7b>] xfs_reclaim_inodes_ag+0x27b/0x430 [xfs]
[ 3111.468226] [<ffffffffc031e900>] ? xfs_reclaim_inodes_ag+0x100/0x430 [xfs]
[ 3111.468881] [<ffffffffc031ed13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 3111.469535] [<ffffffffc032afa5>] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
[ 3111.470182] [<ffffffff8d1c107c>] super_cache_scan+0x16c/0x180
[ 3111.470831] [<ffffffff8d16225b>] shrink_slab_node+0x14b/0x2e0
[ 3111.471478] [<ffffffff8d1644af>] ? shrink_slab+0x3f/0x180
[ 3111.472121] [<ffffffff8d1644fe>] shrink_slab+0x8e/0x180
[ 3111.472761] [<ffffffff8d1677c6>] try_to_free_pages+0x516/0x970
[ 3111.473401] [<ffffffff8d1f2c57>] ? __set_page_dirty+0x27/0xc0
[ 3111.474040] [<ffffffff8d15af19>] __alloc_pages_nodemask+0x7a9/0xb00
[ 3111.474670] [<ffffffff8d19e806>] alloc_pages_current+0x106/0x1f0
[ 3111.475285] [<ffffffff8d040cb7>] ? pte_alloc_one+0x17/0x80
[ 3111.475879] [<ffffffff8d040cb7>] pte_alloc_one+0x17/0x80
[ 3111.476455] [<ffffffff8d17b4a7>] __pte_alloc+0x27/0x130
[ 3111.477012] [<ffffffff8d17effc>] handle_mm_fault+0xafc/0xbb0
[ 3111.477549] [<ffffffff8d17f27e>] __get_user_pages+0x1ce/0x620
[ 3111.478072] [<ffffffff8d17f984>] get_dump_page+0x54/0x80
[ 3111.478576] [<ffffffff8d219da9>] elf_core_dump+0x12d9/0x1420
[ 3111.479065] [<ffffffff8d219330>] ? elf_core_dump+0x860/0x1420
[ 3111.479546] [<ffffffff8d221ec2>] do_coredump+0xc02/0xf60
[ 3111.480021] [<ffffffff8d081358>] get_signal_to_deliver+0x2b8/0x6b0
[ 3111.480498] [<ffffffff8d0024d7>] do_signal+0x57/0x9d0
[ 3111.480964] [<ffffffff8d11117e>] ? __acct_update_integrals+0x8e/0x120
[ 3111.481435] [<ffffffff8d73a29b>] ? preempt_count_sub+0x6b/0xf0
[ 3111.481906] [<ffffffff8d735ae1>] ? _raw_spin_unlock+0x31/0x50
[ 3111.482377] [<ffffffff8d0ab671>] ? vtime_account_user+0x91/0xa0
[ 3111.482847] [<ffffffff8d14f59b>] ? context_tracking_user_exit+0x9b/0x100
[ 3111.483324] [<ffffffff8d002eac>] do_notify_resume+0x5c/0xa0
[ 3111.483800] [<ffffffff8d736806>] retint_signal+0x46/0x90
[ 3111.484274] ---[ end trace 8cc272739908f779 ]---

2014-02-12 04:22:22

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> [ 3111.414202] [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
> [ 3111.414855] [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> [ 3111.415517] [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.416175] [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> [ 3111.416843] [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
> [ 3111.417509] [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
> [ 3111.418175] [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> [ 3111.418846] [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
> [ 3111.419518] [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]

Very interesting. The first thing xlog_cil_push() is doing is blocking
kmalloc(). So at that point it still hadn't been atomic. I'd probably
slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

2014-02-12 05:40:50

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 04:22:15AM +0000, Al Viro wrote:
> On Tue, Feb 11, 2014 at 11:03:58PM -0500, Dave Jones wrote:
> > [ 3111.414202] [<ffffffff8d1f9036>] bio_alloc_bioset+0x156/0x210
> > [ 3111.414855] [<ffffffffc0314231>] _xfs_buf_ioapply+0x1c1/0x3c0 [xfs]
> > [ 3111.415517] [<ffffffffc03858f2>] ? xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.416175] [<ffffffffc031449b>] xfs_buf_iorequest+0x6b/0xf0 [xfs]
> > [ 3111.416843] [<ffffffffc03858f2>] xlog_bdstrat+0x22/0x60 [xfs]
> > [ 3111.417509] [<ffffffffc0387a87>] xlog_sync+0x3a7/0x5b0 [xfs]
> > [ 3111.418175] [<ffffffffc0387d9f>] xlog_state_release_iclog+0x10f/0x120 [xfs]
> > [ 3111.418846] [<ffffffffc0388840>] xlog_write+0x6f0/0x800 [xfs]
> > [ 3111.419518] [<ffffffffc038a061>] xlog_cil_push+0x2f1/0x410 [xfs]
>
> Very interesting. The first thing xlog_cil_push() is doing is blocking
> kmalloc(). So at that point it still hadn't been atomic. I'd probably
> slap may_sleep() in the beginning of xlog_sync() and see if that triggers...

None of the XFS code disables interrupts in that path, not does is
call outside XFS except to dispatch IO. The stack is pretty deep at
this point and I know that the standard (non stacked) IO stack can
consume >3kb of stack space when it gets down to having to do memory
reclaim during GFP_NOIO allocation at the lowest level of SCSI
drivers. Stack overruns typically show up with symptoms like we are
seeing.

Simple example with memory allocation follows. keep in mind that
memory reclaim uses a whole lot more stack if it is needed, and that
scheduling at this point requires about 1k of stack to be free for
the scheduler footprint, too.

FWIW, the blk-mq stuff seems to hae added 200-300 bytes of new stack
usage to the IO path....

$ sudo cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (45 entries)
----- ---- --------
0) 5944 40 zone_statistics+0xbd/0xc0
1) 5904 256 get_page_from_freelist+0x3a8/0x8a0
2) 5648 256 __alloc_pages_nodemask+0x143/0x8e0
3) 5392 80 alloc_pages_current+0xb2/0x170
4) 5312 64 new_slab+0x265/0x2e0
5) 5248 240 __slab_alloc+0x2fb/0x4c4
6) 5008 80 __kmalloc+0x133/0x180
7) 4928 112 virtqueue_add_sgs+0x2fe/0x520
8) 4816 288 __virtblk_add_req+0xd5/0x180
9) 4528 96 virtio_queue_rq+0xdd/0x1d0
10) 4432 112 __blk_mq_run_hw_queue+0x1c3/0x3c0
11) 4320 16 blk_mq_run_hw_queue+0x35/0x40
12) 4304 80 blk_mq_insert_requests+0xc5/0x120
13) 4224 96 blk_mq_flush_plug_list+0x129/0x140
14) 4128 112 blk_flush_plug_list+0xe7/0x240
15) 4016 32 blk_finish_plug+0x18/0x50
16) 3984 192 _xfs_buf_ioapply+0x30f/0x3b0
17) 3792 48 xfs_buf_iorequest+0x6f/0xc0
....
37) 928 16 xfs_vn_create+0x13/0x20
38) 912 64 vfs_create+0xb5/0xf0
39) 848 208 do_last.isra.53+0x6e0/0xd00
40) 640 176 path_openat+0xbe/0x620
41) 464 208 do_filp_open+0x43/0xa0
42) 256 112 do_sys_open+0x13c/0x230
43) 144 16 SyS_open+0x22/0x30
44) 128 128 system_call_fastpath+0x16/0x1b


Dave, before chasing ghosts, can you (like Eric originally asked)
turn on stack overrun detection?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-12 05:50:41

by Dave Jones

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:

> None of the XFS code disables interrupts in that path, not does is
> call outside XFS except to dispatch IO. The stack is pretty deep at
> this point and I know that the standard (non stacked) IO stack can
> consume >3kb of stack space when it gets down to having to do memory
> reclaim during GFP_NOIO allocation at the lowest level of SCSI
> drivers. Stack overruns typically show up with symptoms like we are
> seeing.
> ..
>
> Dave, before chasing ghosts, can you (like Eric originally asked)
> turn on stack overrun detection?

CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

Dave

2014-02-12 06:10:44

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>
> > None of the XFS code disables interrupts in that path, not does is
> > call outside XFS except to dispatch IO. The stack is pretty deep at
> > this point and I know that the standard (non stacked) IO stack can
> > consume >3kb of stack space when it gets down to having to do memory
> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > drivers. Stack overruns typically show up with symptoms like we are
> > seeing.
> > ..
> >
> > Dave, before chasing ghosts, can you (like Eric originally asked)
> > turn on stack overrun detection?
>
> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.

That only checks stack usage when an interrupt is taken. If no
interrupts are taken when stack usage is within 128 bytes of
overflow, then it doesn't catch it.

I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
stack usage of a process via canary overwrites and it records it in
do_exit(). I also use the stack tracer to record the largest stack
usage seen so I know exactly what code paths are approaching stack
overruns...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-12 06:28:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <[email protected]> wrote:
>
> None of the XFS code disables interrupts in that path, not does is
> call outside XFS except to dispatch IO. The stack is pretty deep at
> this point and I know that the standard (non stacked) IO stack can
> consume >3kb of stack space when it gets down to having to do memory
> reclaim during GFP_NOIO allocation at the lowest level of SCSI
> drivers. Stack overruns typically show up with symptoms like we are
> seeing.

That would also explain why it shows up for do_coredump(), even though
clearly interrupts are not disabled at that point. It's just because
do_coredump() opens a filename at a deeper point in the stack than the
more normal system call paths do.

It looks like just "do_signal()" has a stack frame that is about 230
bytes even under normal circumstancs (largely due to "struct ksignal"
- which in turn is largely due to the insane 128-byte padding in
siginfo_t). Add a few other frames in there, and I guess that if it
was close before, the coredump path just makes it go off.

And some of the debug options that I'm sure DaveJ has enabled tend to
make the stack usage worse (simply because they make a lot of data
structures bigger).

Linus

2014-02-12 06:31:58

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> >
> > > None of the XFS code disables interrupts in that path, not does is
> > > call outside XFS except to dispatch IO. The stack is pretty deep at
> > > this point and I know that the standard (non stacked) IO stack can
> > > consume >3kb of stack space when it gets down to having to do memory
> > > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > > drivers. Stack overruns typically show up with symptoms like we are
> > > seeing.
> > > ..
> > >
> > > Dave, before chasing ghosts, can you (like Eric originally asked)
> > > turn on stack overrun detection?
> >
> > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
>
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
>
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...

FYI, just creating lots of files with open(O_CREAT):

[ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
[ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
[ 349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
[ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
[ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
[ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
[ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left

We've got absolutely no spare stack space anymore in the IO path.
And the IO path can't get much simpler than filesystem -> virtio
block device.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-12 07:00:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <[email protected]> wrote:
>
> FYI, just creating lots of files with open(O_CREAT):
>
> [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> [ 349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
>
> We've got absolutely no spare stack space anymore in the IO path.
> And the IO path can't get much simpler than filesystem -> virtio
> block device.

Ugh, that's bad. A thousand bytes of stack space is much too close to
any limits.

Do you have the stack traces for these things so that we can look at
worst offenders?

If the new block-mq code is to blame, it needs to be fixed.
__virtblk_add_req() has a 300-byte stack frame, it seems. Looking
elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
hopefully not in any normal path - online discard is moronic, and I'm
assuming XFS doesn't do that.

There's a lot of 200+ byte stack frames in block/blk-core.s, and they
all seem to be of the type perf_trace_block_buffer() - things created
with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
frame, I have no idea. That sounds like a potential disaster too,
although hopefully it's mostly leaf functions - but leaf functions
*deep* in the callchain. Tejun? Steven, why _do_ they end up with such
huge frames?

And if the stack use comes from the VFS layer, we can probably work on
that too. But I don't think that has really changed much lately..

Linus

2014-02-12 07:18:36

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 9:40 PM, Dave Chinner <[email protected]> wrote:
> >
> > None of the XFS code disables interrupts in that path, not does is
> > call outside XFS except to dispatch IO. The stack is pretty deep at
> > this point and I know that the standard (non stacked) IO stack can
> > consume >3kb of stack space when it gets down to having to do memory
> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > drivers. Stack overruns typically show up with symptoms like we are
> > seeing.
>
> That would also explain why it shows up for do_coredump(), even though
> clearly interrupts are not disabled at that point. It's just because
> do_coredump() opens a filename at a deeper point in the stack than the
> more normal system call paths do.

Right, it's exactly the same problem we have when knfsd is making
the VFS calls.

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

Yup. But it's when you see this sort of thing you realise that
almost no GFP_KERNEL memory allocation is really safe from stack
overruns, though:

0) 6064 64 update_group_power+0x2c/0x270
1) 6000 384 find_busiest_group+0x10a/0x8b0
2) 5616 288 load_balance+0x165/0x870
3) 5328 96 idle_balance+0x106/0x1b0
4) 5232 112 __schedule+0x7b6/0x7e0
5) 5120 16 schedule+0x29/0x70
6) 5104 176 percpu_ida_alloc+0x1b3/0x3d0
7) 4928 32 blk_mq_wait_for_tags+0x1f/0x40
8) 4896 80 blk_mq_alloc_request_pinned+0x4e/0x110
9) 4816 128 blk_mq_make_request+0x42b/0x510
10) 4688 48 generic_make_request+0xc0/0x110
11) 4640 96 submit_bio+0x71/0x120
12) 4544 192 _xfs_buf_ioapply+0x2cc/0x3b0
13) 4352 48 xfs_buf_iorequest+0x6f/0xc0
14) 4304 32 xlog_bdstrat+0x23/0x60
15) 4272 96 xlog_sync+0x3a4/0x5c0
16) 4176 48 xlog_state_release_iclog+0x121/0x130
17) 4128 192 xlog_write+0x700/0x7c0
18) 3936 192 xlog_cil_push+0x2ae/0x3d0
19) 3744 128 xlog_cil_force_lsn+0x1b8/0x1e0
20) 3616 144 _xfs_log_force_lsn+0x7c/0x300
21) 3472 48 xfs_log_force_lsn+0x3b/0xa0
22) 3424 112 xfs_iunpin_wait+0xd7/0x160
23) 3312 80 xfs_reclaim_inode+0xd0/0x350
24) 3232 432 xfs_reclaim_inodes_ag+0x277/0x3d0
25) 2800 48 xfs_reclaim_inodes_nr+0x33/0x40
26) 2752 16 xfs_fs_free_cached_objects+0x15/0x20
27) 2736 80 super_cache_scan+0x169/0x170
28) 2656 160 shrink_slab_node+0x133/0x290
29) 2496 80 shrink_slab+0x122/0x160
30) 2416 112 do_try_to_free_pages+0x1de/0x360
31) 2304 192 try_to_free_pages+0x110/0x190
32) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0
33) 1856 80 alloc_pages_current+0xb2/0x170
34) 1776 64 new_slab+0x265/0x2e0
35) 1712 240 __slab_alloc+0x2fb/0x4c4
36) 1472 80 kmem_cache_alloc+0x10b/0x140

That's almost 4700 bytes of stack usage from the
kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
Yes, it's an extreme case, but if you're looking for a smoking
gun.... :/

I can fix this one easily - we already have a workqueue for doing
async log pushes (will split the stack between xlog_cil_force_lsn
and xlog_cil_push), but the reason we haven't used it for synchronous
log forces is that screws up fsync performance on CFQ. We don't
recommend CFQ with XFS anyway, so I think I'll make this change
anyway.

> And some of the debug options that I'm sure DaveJ has enabled tend to
> make the stack usage worse (simply because they make a lot of data
> structures bigger).

True - CONFIG_XFS_DEBUG tends to add about 5% to the stack usage of
XFS, but realistically 5% is not significant especially as we've
been hitting stack overflows with XFS on production systems
regularly enough on x86-64 over the past 2-3 years that we know what
"typical symptoms" of such overflows are...

The problem we have now is that everything outside XFS continues to
grow in stack usage, so the only choice that remains for us to avoid
overruns is to add performance impacting stack switches into the
middle of common XFS paths. We also have to do it unconditionally
because we don't know ahead of time if any specific operation is
going to require all the stack we can give it because - for example
- we can't predict when the IO path is going to require memory
allocation...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-12 08:13:41

by Tejun Heo

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

It looks like they're essentially the same for all the automatically
generated trace functions. I'm seeing 232 byte stack frame in most of
them. If I'm not completely confused by these macros, these are
generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
contains struct pt_regs in the stack frame which is already 168 bytes,
so that seems like the culprit. No idea whether this is something
avoidable. At least they shouldn't nest in any way. Steven?

Thanks.

--
tejun

2014-02-12 08:35:57

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <[email protected]> wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [ 349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
>
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
>
> Do you have the stack traces for these things so that we can look at
> worst offenders?

Sure. It's almost through the XFS block allocation path where we do
metadata reads or log writes. That's always been the worst case
stack usage path for XFS. The worst path is about 4800 bytes - it
was a double btree split (inode extent map tree split, followed by a
freespace btree split when allocating a block for the extent map
btree split), triggering a reallocation of a btree block that was
pinned and STALE, which triggered a log force, which then made it's
way into the block layer.

I don't have a stack trace of that because I can't create the
initial conditions required to trigger it on demand. I've only ever
seen it in the wild once. More common, though, is the situation we
see here - somewhere around 3-4k of stack usage from a single extent
btree update operation.

We've already got a stack switch in the data allocation path - we
had to add it to stop the bdi-flusher from busting the stack with
alarming regularity in production systems. I initially made the
metadata allocation paths use it as well, but that part got reverted
(aa29284 xfs: don't defer metadata allocation to the workqueue)
because Mel Gorman demonstrated several workloads that regressed
significantly as a result of making that change.

So now we are stuck betweeen a rock and a hard place - those
metadata block allocation paths are triggering the stack overflows,
but switching stacks in the allocation path to avoid stack overflows
causes massive performance regressions....

The thing is, the XFS stack usage has not changed significantly over
the past 5 years - we keep a pretty close eye on it. The issue is
that everything around XFS has slowly been growing and I can't
point you at one stack trace that demonstrates the worst case. What
I can point you to function calls that consume a lot of stack space
and hence limit what is available to callers.

So here's a few stack fragements I've generated in teh past few
minutes on a couple of XFS filesystems inside a 16p/16GB VM by
running:

$ dd if=/dev/zero of=/mnt/test/foo bs=1024k &

which writes at about 600MB/s to a 18TB RAID array. That runs in
parallel with a multithreaded inode creation/walk/remove workload on
a different filesystem (creates 50 million inodes ~250k inode
creates/s, walks them removes them at around 170k removes/s) which
is hosted on a dm RAID0 stripe of 2 samsung 840 EVO SSDs.

mutex_lock() requires at least 1.2k of stack because of the
scheduler overhead.:

$ sudo cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (39 entries)
----- ---- --------
0) 4368 64 update_group_power+0x2c/0x270
1) 4304 384 find_busiest_group+0x10a/0x8b0
2) 3920 288 load_balance+0x165/0x870
3) 3632 96 idle_balance+0x106/0x1b0
4) 3536 112 __schedule+0x7b6/0x7e0
5) 3424 16 schedule+0x29/0x70
6) 3408 16 schedule_preempt_disabled+0xe/0x10
7) 3392 112 __mutex_lock_slowpath+0x11a/0x400
8) 3280 32 mutex_lock+0x1e/0x32

i.e. any function that is going to schedule needs at least 1k of
stack, and some of the lead in infrastructure (like
wait_for_completion) can require a total of up to 1.5k...

Basic memory allocation can easily >1k of stack without
entering reclaim:

----- ---- --------
0) 4920 40 zone_statistics+0xbd/0xc0
1) 4880 256 get_page_from_freelist+0x3a8/0x8a0
2) 4624 256 __alloc_pages_nodemask+0x143/0x8e0
3) 4368 80 alloc_pages_current+0xb2/0x170
4) 4288 64 new_slab+0x265/0x2e0
5) 4224 240 __slab_alloc+0x2fb/0x4c4
6) 3984 80 kmem_cache_alloc+0x10b/0x140
7) 3904 48 kmem_zone_alloc+0x77/0x100

If it enters reclaim and we are allowed to IO? Well, you saw the
stack I posted in the other thread - 4700 bytes from
kmem_cache_alloc() to the top of the stack. Another bad case is
the swap path (using a scsi device rather than virtio for the
root/swap devices):

Depth Size Location (44 entries)
----- ---- --------
0) 4496 16 mempool_alloc_slab+0x15/0x20
1) 4480 128 mempool_alloc+0x66/0x170
2) 4352 16 scsi_sg_alloc+0x4e/0x60
3) 4336 112 __sg_alloc_table+0x68/0x130
4) 4224 32 scsi_init_sgtable+0x34/0x90
5) 4192 48 scsi_init_io+0x34/0xd0
6) 4144 32 scsi_setup_fs_cmnd+0x66/0xa0
7) 4112 112 sd_prep_fn+0x2a0/0xce0
8) 4000 48 blk_peek_request+0x13c/0x260
9) 3952 112 scsi_request_fn+0x4b/0x490
10) 3840 32 __blk_run_queue+0x37/0x50
11) 3808 64 queue_unplugged+0x39/0xb0
12) 3744 112 blk_flush_plug_list+0x20b/0x240
13) 3632 80 blk_queue_bio+0x1ca/0x310
14) 3552 48 generic_make_request+0xc0/0x110
15) 3504 96 submit_bio+0x71/0x120
16) 3408 160 __swap_writepage+0x184/0x220
17) 3248 32 swap_writepage+0x42/0x90
18) 3216 304 shrink_page_list+0x6fd/0xa20
19) 2912 208 shrink_inactive_list+0x243/0x480
20) 2704 288 shrink_lruvec+0x371/0x670
21) 2416 112 do_try_to_free_pages+0x11a/0x360
22) 2304 192 try_to_free_pages+0x110/0x190
23) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0
24) 1856 80 alloc_pages_current+0xb2/0x170
25) 1776 64 new_slab+0x265/0x2e0
26) 1712 240 __slab_alloc+0x2fb/0x4c4
27) 1472 80 kmem_cache_alloc+0x10b/0x140
28) 1392 48 kmem_zone_alloc+0x77/0x100

There's over 3k in the stack from kmem_zone_alloc(GFP_KERNEL) and
that's got nothing XFS in it at all - it's just memory allocation,
reclaim and Io path. And it also demonstrates that the scsi layer
has significant stack usage.

here's 2.5k from submit_bio() on virtio block device:

Depth Size Location (44 entries)
----- ---- --------
0) 4512 64 update_curr+0x8b/0x160
1) 4448 96 enqueue_task_fair+0x39d/0xd50
2) 4352 48 enqueue_task+0x50/0x60
3) 4304 16 activate_task+0x23/0x30
4) 4288 32 ttwu_do_activate.constprop.84+0x3c/0x70
5) 4256 96 try_to_wake_up+0x1b4/0x2c0
6) 4160 16 default_wake_function+0x12/0x20
7) 4144 32 autoremove_wake_function+0x16/0x40
8) 4112 80 __wake_up_common+0x55/0x90
9) 4032 64 __wake_up+0x48/0x70
10) 3968 80 wakeup_kswapd+0xe7/0x130
11) 3888 256 __alloc_pages_nodemask+0x371/0x8e0
12) 3632 80 alloc_pages_current+0xb2/0x170
13) 3552 64 new_slab+0x265/0x2e0
14) 3488 240 __slab_alloc+0x2fb/0x4c4
15) 3248 80 __kmalloc+0x133/0x180
16) 3168 112 virtqueue_add_sgs+0x2fe/0x520
17) 3056 288 __virtblk_add_req+0xd5/0x180
18) 2768 96 virtio_queue_rq+0xdd/0x1d0
19) 2672 112 __blk_mq_run_hw_queue+0x1c3/0x3c0
20) 2560 16 blk_mq_run_hw_queue+0x35/0x40
21) 2544 80 blk_mq_insert_requests+0xc5/0x120
22) 2464 96 blk_mq_flush_plug_list+0x129/0x140
23) 2368 112 blk_flush_plug_list+0xe7/0x240
24) 2256 128 blk_mq_make_request+0x3ca/0x510
25) 2128 48 generic_make_request+0xc0/0x110
26) 2080 96 submit_bio+0x71/0x120

You can also see the difference in stack usage that the blk_mq layer
adds compared to the previous trace that went through the old
code to get to the SCSI stack.

And you saw the memory reclaim via shrinker path taht I posted in
another email - that was 4700 bytes from kmem_cache_alloc() to teh
top of the stack (can be reduced to about 2000 bytes by having XFS
chop it in half).

> If the new block-mq code is to blame, it needs to be fixed.

It's a symptom of the pattern of gradual growth in just about every
core subsystem. Each individual change is not significant, but when
you put the whole stack together the canary is well and truly dead.

> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.

We *support* it, but we don't *recommend* it. So, no, that's not the
problem path. ;)

> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
on x86-64 (at least, according to the documentation).
CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
stack usage than the tracer. As such, I also think that
CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
stack usage because it is canary based and so captures the very
worst case usage of the process's stack...

It really seems to me that we have got to the point where it is not
safe to do anything like submit IO or try to do memory allocation
with any more than half the stack space consumed....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-12 11:39:35

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:

> It looks like just "do_signal()" has a stack frame that is about 230
> bytes even under normal circumstancs (largely due to "struct ksignal"
> - which in turn is largely due to the insane 128-byte padding in
> siginfo_t). Add a few other frames in there, and I guess that if it
> was close before, the coredump path just makes it go off.

We could, in principle, put it into task_struct and make get_signal()
return its address - do_signal() is called only in the code that does
assorted returns to userland...

2014-02-12 12:40:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

[ Added the perf tracepoint maintainers ]

On Tue, 11 Feb 2014 22:59:58 -0800
Linus Torvalds <[email protected]> wrote:

> On Tue, Feb 11, 2014 at 10:31 PM, Dave Chinner <[email protected]> wrote:
> >
> > FYI, just creating lots of files with open(O_CREAT):
> >
> > [ 348.718357] fs_mark (4828) used greatest stack depth: 2968 bytes left
> > [ 348.769846] fs_mark (4814) used greatest stack depth: 2312 bytes left
> > [ 349.777717] fs_mark (4826) used greatest stack depth: 2280 bytes left
> > [ 418.139415] fs_mark (4928) used greatest stack depth: 1936 bytes left
> > [ 460.492282] fs_mark (4993) used greatest stack depth: 1336 bytes left
> > [ 544.825418] fs_mark (5104) used greatest stack depth: 1112 bytes left
> > [ 689.503970] fs_mark (5265) used greatest stack depth: 1000 bytes left
> >
> > We've got absolutely no spare stack space anymore in the IO path.
> > And the IO path can't get much simpler than filesystem -> virtio
> > block device.
>
> Ugh, that's bad. A thousand bytes of stack space is much too close to
> any limits.
>
> Do you have the stack traces for these things so that we can look at
> worst offenders?
>
> If the new block-mq code is to blame, it needs to be fixed.
> __virtblk_add_req() has a 300-byte stack frame, it seems. Looking
> elsewhere, blkdev_issue_discard() has 350 bytes of stack frame, but is
> hopefully not in any normal path - online discard is moronic, and I'm
> assuming XFS doesn't do that.
>
> There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> all seem to be of the type perf_trace_block_buffer() - things created
> with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> frame, I have no idea. That sounds like a potential disaster too,
> although hopefully it's mostly leaf functions - but leaf functions
> *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> huge frames?

The perf_trace_##event is defined in include/trace/ftrace.h.

There we have this:

perf_trace_##call(void *__data, proto) \
{ \
struct ftrace_event_call *event_call = __data; \
struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
struct ftrace_raw_##call *entry; \
struct pt_regs __regs; \
u64 __addr = 0, __count = 1; \
struct task_struct *__task = NULL; \
struct hlist_head *head; \
int __entry_size; \
int __data_size; \
int rctx; \
\

Mostly pointers except for two structures. The __data_offests, is
dynamically defined, and only consists of values from the tracepoint
entry_structure that defines dynamic arrays. But the other structure on
the stack looks a bit harrier. The pt_regs structure.

That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
culprit here.

Peter and Frederic, is there a way not to store that on the stack?

-- Steve

>
> And if the stack use comes from the VFS layer, we can probably work on
> that too. But I don't think that has really changed much lately..
>
> Linus

2014-02-12 12:44:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, 12 Feb 2014 03:13:33 -0500
Tejun Heo <[email protected]> wrote:

> On Tue, Feb 11, 2014 at 10:59:58PM -0800, Linus Torvalds wrote:
> > There's a lot of 200+ byte stack frames in block/blk-core.s, and they
> > all seem to be of the type perf_trace_block_buffer() - things created
> > with DECLARE_EVENT_CLASS(), afaik. Why they all have 200+ bytes of
> > frame, I have no idea. That sounds like a potential disaster too,
> > although hopefully it's mostly leaf functions - but leaf functions
> > *deep* in the callchain. Tejun? Steven, why _do_ they end up with such
> > huge frames?
>
> It looks like they're essentially the same for all the automatically
> generated trace functions. I'm seeing 232 byte stack frame in most of
> them. If I'm not completely confused by these macros, these are
> generated by DECLARE_EVENT_CLASS() in include/trace/ftrace.h and
> contains struct pt_regs in the stack frame which is already 168 bytes,
> so that seems like the culprit. No idea whether this is something
> avoidable. At least they shouldn't nest in any way. Steven?

They shouldn't nest. But if the perf tracepoint is active, I don't know
how much more of the stack is used in the functions that tracepoint
calls.

-- Steve

2014-02-12 12:51:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, 12 Feb 2014 19:35:13 +1100
Dave Chinner <[email protected]> wrote:


> And it's leaf functions that the CONFIG_STACK_TRACER doesn't catch
> on x86-64 (at least, according to the documentation).
> CONFIG_DEBUG_STACK_USAGE output is showing up to 800 bytes more
> stack usage than the tracer. As such, I also think that
> CONFIG_DEBUG_STACK_USAGE output is a more reliable iindication of
> stack usage because it is canary based and so captures the very
> worst case usage of the process's stack...

Yeah, with the new fentry (adding the mcount call before setting up the
stack frame), the function tracer can not catch leaf functions, as it
is called before the leaf function's frame is set up.

Hmm, I wonder if I should add a config to disable fentry and go back to
the old mcount that gets called after setting up the stack frame. This
will lead to better stack tracing, but you lose out on all the benefits
of fentry.

-- Steve

2014-02-12 13:30:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 07:40:36AM -0500, Steven Rostedt wrote:
> The pt_regs structure.
>
> That's what? 21 unsigned longs? 21 * 8 = 168. I think that's the
> culprit here.
>
> Peter and Frederic, is there a way not to store that on the stack?

Something like so?

---
include/trace/ftrace.h | 7 ++++---
kernel/trace/trace_event_perf.c | 5 ++++-
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 1a8b28db3775..87ae3ef1d278 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -678,7 +678,7 @@ perf_trace_##call(void *__data, proto) \
struct ftrace_event_call *event_call = __data; \
struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
struct ftrace_raw_##call *entry; \
- struct pt_regs __regs; \
+ struct pt_regs *__regs; \
u64 __addr = 0, __count = 1; \
struct task_struct *__task = NULL; \
struct hlist_head *head; \
@@ -697,18 +697,19 @@ perf_trace_##call(void *__data, proto) \
sizeof(u64)); \
__entry_size -= sizeof(u32); \
\
- perf_fetch_caller_regs(&__regs); \
entry = perf_trace_buf_prepare(__entry_size, \
event_call->event.type, &__regs, &rctx); \
if (!entry) \
return; \
\
+ perf_fetch_caller_regs(__regs); \
+ \
tstruct \
\
{ assign; } \
\
perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
- __count, &__regs, head, __task); \
+ __count, __regs, head, __task); \
}

/*
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index e854f420e033..1885f4aac109 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -232,8 +232,10 @@ void perf_trace_del(struct perf_event *p_event, int flags)
tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
}

+static DEFINE_PER_CPU(struct pt_regs, tp_regs[4]);
+
__kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
- struct pt_regs *regs, int *rctxp)
+ struct pt_regs **regs, int *rctxp)
{
struct trace_entry *entry;
unsigned long flags;
@@ -252,6 +254,7 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type,
if (*rctxp < 0)
return NULL;

+ *regs = this_cpu_ptr(&tp_regs[*rctxp]);
raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);

/* zero the dead bytes from align to not leak stack to user */

2014-02-12 14:25:55

by Dave Jones

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> >
> > > None of the XFS code disables interrupts in that path, not does is
> > > call outside XFS except to dispatch IO. The stack is pretty deep at
> > > this point and I know that the standard (non stacked) IO stack can
> > > consume >3kb of stack space when it gets down to having to do memory
> > > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > > drivers. Stack overruns typically show up with symptoms like we are
> > > seeing.
> > > ..
> > >
> > > Dave, before chasing ghosts, can you (like Eric originally asked)
> > > turn on stack overrun detection?
> >
> > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
>
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
>
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit().

I had that on too. The only message from it came from quite a while
before the trace that happened overnight..

[ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
[12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203

> I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...

I can give that a try later.

Dave

2014-02-12 15:57:58

by Eric Sandeen

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 2/12/14, 12:10 AM, Dave Chinner wrote:
> On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
>> On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
>>
>> > None of the XFS code disables interrupts in that path, not does is
>> > call outside XFS except to dispatch IO. The stack is pretty deep at
>> > this point and I know that the standard (non stacked) IO stack can
>> > consume >3kb of stack space when it gets down to having to do memory
>> > reclaim during GFP_NOIO allocation at the lowest level of SCSI
>> > drivers. Stack overruns typically show up with symptoms like we are
>> > seeing.
>> > ..
>> >
>> > Dave, before chasing ghosts, can you (like Eric originally asked)
>> > turn on stack overrun detection?
>>
>> CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
>
> That only checks stack usage when an interrupt is taken. If no
> interrupts are taken when stack usage is within 128 bytes of
> overflow, then it doesn't catch it.
>
> I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> stack usage of a process via canary overwrites and it records it in
> do_exit(). I also use the stack tracer to record the largest stack
> usage seen so I know exactly what code paths are approaching stack
> overruns...
>
> Cheers,
>
> Dave.
>


I'm not sure if I'm off base here, but maybe this would make sense: check
for a corrupted stack in __might_sleep. Compile tested only,
possibly inelegant, and/or completely wrong, but:


From: Eric Sandeen <[email protected]>

sched: Test for corrupted task_struct in __might_sleep

If a thread overruns the stack, it may corrupt the task_struct,
leading to false positives on tests like irqs_disabled().

Warn if this seems to be the case.

Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b46131e..6920c3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6934,6 +6934,8 @@ static inline int preempt_count_equals(int preempt_offset)

void __might_sleep(const char *file, int line, int preempt_offset)
{
+ struct task_struct *tsk = current;
+ unsigned long *stackend;
static unsigned long prev_jiffy; /* ratelimiting */

rcu_sleep_check(); /* WARN_ON_ONCE() by default, no rate limit reqd. */
@@ -6952,6 +6954,11 @@ void __might_sleep(const char *file, int line, int preempt_offset)
in_atomic(), irqs_disabled(),
current->pid, current->comm);

+ /* A corrupted stack can cause a false positive on irqs_disabled etc */
+ stackend = end_of_stack(tsk);
+ if (tsk != &init_task && *stackend != STACK_END_MAGIC)
+ printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
+
debug_show_held_locks(current);
if (irqs_disabled())
print_irqtrace_events(current);

2014-02-12 20:13:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 3:39 AM, Al Viro <[email protected]> wrote:
> On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
>
>> It looks like just "do_signal()" has a stack frame that is about 230
>> bytes even under normal circumstancs (largely due to "struct ksignal"
>> - which in turn is largely due to the insane 128-byte padding in
>> siginfo_t). Add a few other frames in there, and I guess that if it
>> was close before, the coredump path just makes it go off.
>
> We could, in principle, put it into task_struct and make get_signal()
> return its address - do_signal() is called only in the code that does
> assorted returns to userland...

We have better uses for random buffers in "struct task_struct", I'd
hate to put a siginfo_t there.

The thing is, siginfo_t has that idiotic 128-byte area, but it's all
"for future expansion". I think it's some damn glibc disease - we've
seen these kinds of insane paddings before.

The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.

Sad.

Linus

2014-02-12 21:14:31

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 12:13:19PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 3:39 AM, Al Viro <[email protected]> wrote:
> > On Tue, Feb 11, 2014 at 10:28:12PM -0800, Linus Torvalds wrote:
> >
> >> It looks like just "do_signal()" has a stack frame that is about 230
> >> bytes even under normal circumstancs (largely due to "struct ksignal"
> >> - which in turn is largely due to the insane 128-byte padding in
> >> siginfo_t). Add a few other frames in there, and I guess that if it
> >> was close before, the coredump path just makes it go off.
> >
> > We could, in principle, put it into task_struct and make get_signal()
> > return its address - do_signal() is called only in the code that does
> > assorted returns to userland...
>
> We have better uses for random buffers in "struct task_struct", I'd
> hate to put a siginfo_t there.

*nod*

> The thing is, siginfo_t has that idiotic 128-byte area, but it's all
> "for future expansion". I think it's some damn glibc disease - we've
> seen these kinds of insane paddings before.
>
> The actual *useful* part of siginfo_t is on the order of 32 bytes. If that.
>
> Sad.

Umm... What if we delay __sigqueue_free()? After all, that's where the
fat sucker normally comes from. That way we might get away with much
smaller structure on stack...

Just introduce a small structure that would contain signr, uid, pid and
pointer to struct sigqueue. And pass a pointer to _that_ all the way down
to collect_signal(). Pointer's NULL == it's SI_USER with signr/uid/pid
from the small struct and all other fields are zero. Pointer isn't NULL -
use &small_struct->p->info. And have struct sigqueue actually freed
via task_work_add() in that case.

Do you see any fundamental problems with that? Looks like it would be
faster as well - less copying involved...

2014-02-12 21:14:44

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 09:25:38AM -0500, Dave Jones wrote:
> On Wed, Feb 12, 2014 at 05:10:38PM +1100, Dave Chinner wrote:
> > On Wed, Feb 12, 2014 at 12:50:27AM -0500, Dave Jones wrote:
> > > On Wed, Feb 12, 2014 at 04:40:43PM +1100, Dave Chinner wrote:
> > >
> > > > None of the XFS code disables interrupts in that path, not does is
> > > > call outside XFS except to dispatch IO. The stack is pretty deep at
> > > > this point and I know that the standard (non stacked) IO stack can
> > > > consume >3kb of stack space when it gets down to having to do memory
> > > > reclaim during GFP_NOIO allocation at the lowest level of SCSI
> > > > drivers. Stack overruns typically show up with symptoms like we are
> > > > seeing.
> > > > ..
> > > >
> > > > Dave, before chasing ghosts, can you (like Eric originally asked)
> > > > turn on stack overrun detection?
> > >
> > > CONFIG_DEBUG_STACKOVERFLOW ? Already turned on.
> >
> > That only checks stack usage when an interrupt is taken. If no
> > interrupts are taken when stack usage is within 128 bytes of
> > overflow, then it doesn't catch it.
> >
> > I tend to use CONFIG_DEBUG_STACK_USAGE=y as it records the maximum
> > stack usage of a process via canary overwrites and it records it in
> > do_exit().
>
> I had that on too. The only message from it came from quite a while
> before the trace that happened overnight..

Right, it won't capture an overrun at the point in time an overrun
occurs, either, because it only checks when the process exits. But
it does tell you what stack usage is being seen, as this:

> [ 3415.655125] trinity-c0 (4383) used greatest stack depth: 992 bytes left
> [12900.804230] BUG: sleeping function called from invalid context at mm/mempool.c:203

is a pretty a good indication that trinity is at risk of stack
overuns...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-12 21:32:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <[email protected]> wrote:
>
> Umm... What if we delay __sigqueue_free()? After all, that's where the
> fat sucker normally comes from. That way we might get away with much
> smaller structure on stack...

Sounds like the RightThing(tm) to do to me, and I don't see why it
wouldn't work.

We'd have to teach each user of "dequeue_signal()" to free the siginfo
thing. Which shouldn't be too bad - I think we've collected all of
that into generic code, and there isn't the mass or architecture code
that knows about these things any more. But there are a few odd
drivers etc and signalfd. I didn't look at what the lifetimes were.

Adding Oleg to the cc, since any time we touch any of that code, he
should be involved. Oleg - the issue is the biggish size of 'struct
ksignal' on stack, brought on by the silly "put a whole siginfo_t in
it".

Linus

2014-02-12 21:44:15

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <[email protected]> wrote:
> >
> > Umm... What if we delay __sigqueue_free()? After all, that's where the
> > fat sucker normally comes from. That way we might get away with much
> > smaller structure on stack...
>
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.
>
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd. I didn't look at what the lifetimes were.

Only signalfd, AFAICS. And there we'd want to use the same small structure -
it's used in
do {
ret = signalfd_dequeue(ctx, &info, nonblock);
if (unlikely(ret <= 0))
break;
ret = signalfd_copyinfo(siginfo, &info);
if (ret < 0)
break;
siginfo++;
total += ret;
nonblock = 1;
} while (--count);
and using a smaller struct would actually speed the things up - skips one
copying. sigqueue would be freed as soon as we'd done signalfd_copyinfo()
(if not by signalfd_copyinfo() itself).

I'll try to put something along those lines together, if you or Oleg don't
do it first.

2014-02-13 17:40:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/12, Linus Torvalds wrote:
>
> On Wed, Feb 12, 2014 at 1:14 PM, Al Viro <[email protected]> wrote:
> >
> > Umm... What if we delay __sigqueue_free()? After all, that's where the
> > fat sucker normally comes from. That way we might get away with much
> > smaller structure on stack...
>
> Sounds like the RightThing(tm) to do to me, and I don't see why it
> wouldn't work.

Probably... I'll try to reply tomorrow.

> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing.

And we should be careful with SIGQUEUE_PREALLOC, at least
collect_signal() should not do list_del_init()... Plus we need to
handle the SEND_SIG_FORCED-like case.

Oleg.

2014-02-13 17:58:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <[email protected]> wrote:
>
> And we should be careful with SIGQUEUE_PREALLOC, at least
> collect_signal() should not do list_del_init()... Plus we need to
> handle the SEND_SIG_FORCED-like case.

I don't think the users need to care. They'd just call
"sigqueue_free()" not knowing about our preallocations etc. That kind
of detail should be confined to inside signal.c.

But there really aren't that many users. There's a couple of
"dequeue_signal_lock()" users, but they don't actually *want* the
siginfo at all (they're kernel threads), so we can just make that
function free the siginfo immediately (and get rid of the totally
unnecessay kernel stack allocation). And outside of signal.c only
signalfd uses "dequeue_signal()" itself, and that would be the only
one that would need to be taught to use (in signalfd_copyinfo()) and
then free the sigqueue entry.

So it really looks like the right thing to do, and fairly
straightforward. But I'm leaving the coding proof to Al, since he
already offered ;)

Linus

2014-02-13 18:10:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/13, Linus Torvalds wrote:
>
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
>
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc.

Yes, but we need to be careful to avoid the races with
release_posix_timer().

> That kind
> of detail should be confined to inside signal.c.

Yes, sure.

> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation).

Yes. Perhaps this helper should be changed/renamed. And perhaps we
can even change __send_signal() to avoid __sigqueue_alloc() if
PF_KTHREAD. Or sig == SIGKILL.

Oleg.

2014-02-13 18:37:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/13, Oleg Nesterov wrote:
>
> On 02/13, Linus Torvalds wrote:
> >
> > On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <[email protected]> wrote:
> > >
> > > And we should be careful with SIGQUEUE_PREALLOC, at least
> > > collect_signal() should not do list_del_init()... Plus we need to
> > > handle the SEND_SIG_FORCED-like case.
> >
> > I don't think the users need to care. They'd just call
> > "sigqueue_free()" not knowing about our preallocations etc.
>
> Yes, but we need to be careful to avoid the races with
> release_posix_timer().

Plus we need to delay do_schedule_next_timer() as well. But this
is probably good because we can avoid unlock/lock(siglock) in
dequeue_signal().

Oleg.

2014-02-13 20:51:56

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:

> I'll try to put something along those lines together, if you or Oleg don't
> do it first.

OK, having looked at that stuff...

1) things become much more compact if we finish conversion to get_signal()
first. Callers of get_signal_to_deliver() have k_sigaction and siginfo in
pair of local variables; switching to ksignal will be neutral wrt stack
footprint (it just gathers those two in one struct) *and* we are getting
rid of passing struct siginfo * around. With that done, we can change
struct ksignal ->info with zero impact on the code in arch/*, and conversion
makes sense on its own. In the mainline we have it done for alpha, arm,
openrisc, sparc and x86. I've just put together preliminary (and completely
untested) patches for arm64, m68k and um; doing the rest won't take long, but
they'll obviously need to be tested. It's a fairly safe conversion;
I'd expect the worst bugs to be typos.

2) struct small_siginfo ought to go into linux/signal.h. Contains signal
number, si_code (usually 0), uid, pid and (often NULL) pointer to
struct sigqueue. Overall: 20 or 24 bytes. In struct coredump_params we
should replace ->siginfo with pointer to struct small_siginfo. Ditto for
task_struct ->last_siginfo. In struct ksignal ->info becomes
struct small_siginfo.

copy_siginfo_to_user{,32}() gets switched to struct small_siginfo *, so does
do_coredump(), ptrace_signal(), ptrace_stop(), ptrace_getsiginfo(),
ptrace_setsiginfo(), collect_signal(), __dequeue_signal(), dequeue_signal(),
signalfd_dequeue() and do_sigtimedwait().

New primitive: assign_sigqueue(small_info, sigqueue). Frees
small_info->q and assigns a new value to it.

When ptrace_setsiginfo() is given a non-plain siginfo (si_code != SI_USER,
that is), it should a allocate struct sigqueue, copy the sucker there and
give it to assign_sigqueue(). Plain ones just get uid/pid/signo copied
and NULL passed to assign_sigqueue().

code in ptrace_signal() under if (signr != info->si_signo) should
start with assign_sigqueue(info, NULL).

specific_send_sig_info() in ptrace_signal() is rather clumsy to handle; not
sure what's the best way to deal with that. In any case, ptrace_signal()
deciding to return 0 should take care of info->q - either copy and free the
original, or actually try to reuse the sucker. Either way, info->q should
become NULL.

get_signal_to_deliver() should do assign_sigqueue(info, NULL) before
the call of do_group_exit().

arch_ptrace_stop_needed() can be left as-is; it's a macro and none of the
instances look at the info argument at all. The same goes for
arch_ptrace_stop().

signalfd_read() should do assign_sigqueue(&info, NULL) right after
signalfd_copyinfo().

rt_sigtimedwait(2) and its compat variant should do the same right after
copy_siginfo_to_user{,32}().

TP'ed stuff is a mess, as usual. AFAICS, TP_STORE_SIGINFO() needs to
be split into a variant taking siginfo (on trace_signal_generate side)
and one taking small_siginfo (for trace_signal_deliver).

get_signal() should do assign_sigqueue(&ksig->info, NULL) if it returns
false and use task_work_add() to schedule info->q for freeing otherwise.


Overall, aside of the need to complete arch/* conversion, the only unfinished
part is this:
/* If the (new) signal is now blocked, requeue it. */
if (sigismember(&current->blocked, signr)) {
specific_send_sig_info(signr, info, current);
signr = 0;
}
in the end of ptrace_signal(). Sure, we can add a local struct siginfo
in that if, fill it if needed and pass its address or &info->q->info
to specific_send_sig_info(), but that feels kludgy... Hell knows, I'll
look a bit more into that.

2014-02-14 00:09:12

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
>
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
>
> OK, having looked at that stuff...
>
> 1) things become much more compact if we finish conversion to get_signal()
> first. Callers of get_signal_to_deliver() have k_sigaction and siginfo in
> pair of local variables; switching to ksignal will be neutral wrt stack
> footprint (it just gathers those two in one struct) *and* we are getting
> rid of passing struct siginfo * around. With that done, we can change
> struct ksignal ->info with zero impact on the code in arch/*, and conversion
> makes sense on its own. In the mainline we have it done for alpha, arm,
> openrisc, sparc and x86. I've just put together preliminary (and completely
> untested) patches for arm64, m68k and um; doing the rest won't take long, but
> they'll obviously need to be tested. It's a fairly safe conversion;
> I'd expect the worst bugs to be typos.

OK, there's a couple of tricks that allow to reorder that. First of all,
temporary config symbol (ARCH_USES_KSIGNAL) selecting that stuff; if it's
not selected, we just have #define small_siginfo siginfo and
#define assign_sigqueue(info, q) WARN_ON((q) != NULL)
and that's it - most of the changes in core kernel consist of s/siginfo/small_&/
and non-trivial ones are under ifdef CONFIG_ARCH_USES_KSIGNAL for the time
being. Small ifdefs, at that...

Once all architectures get converted, we'll just kill that config symbol and
make ifdefs unconditional.

Another is that we don't need to bother with task_work_add() at all; on
converted architectures we have signal_setup_done(..., ksig, ...) called after
each successful get_signal() (the first argument of signal_setup_done() is
"has sigframe setup failed" flag; we get there regardless of success or
failure of setup_...frame()). So we can just have that sucker do
assign_sigqueue(&ksig->info, NULL) and be done with that - all this delayed
freeing is explicit now (and we don't get locking overhead, etc. of
task_work_add()).

If what I've got survives the local beating, I'll put it into signal.git,
throw the arch conversions I could test in there and ask on linux-arch for
help with the missing ones. With any luck we'll get the full set by the
next merge window, at which point we'll be able to kill ifdefs.

2014-02-14 00:24:35

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 06:18:29PM +1100, Dave Chinner wrote:
> > It looks like just "do_signal()" has a stack frame that is about 230
> > bytes even under normal circumstancs (largely due to "struct ksignal"
> > - which in turn is largely due to the insane 128-byte padding in
> > siginfo_t). Add a few other frames in there, and I guess that if it
> > was close before, the coredump path just makes it go off.
>
> Yup. But it's when you see this sort of thing you realise that
> almost no GFP_KERNEL memory allocation is really safe from stack
> overruns, though:
>
> 0) 6064 64 update_group_power+0x2c/0x270
> 1) 6000 384 find_busiest_group+0x10a/0x8b0
> 2) 5616 288 load_balance+0x165/0x870
> 3) 5328 96 idle_balance+0x106/0x1b0
> 4) 5232 112 __schedule+0x7b6/0x7e0
> 5) 5120 16 schedule+0x29/0x70
> 6) 5104 176 percpu_ida_alloc+0x1b3/0x3d0
> 7) 4928 32 blk_mq_wait_for_tags+0x1f/0x40
> 8) 4896 80 blk_mq_alloc_request_pinned+0x4e/0x110
> 9) 4816 128 blk_mq_make_request+0x42b/0x510
> 10) 4688 48 generic_make_request+0xc0/0x110
> 11) 4640 96 submit_bio+0x71/0x120
> 12) 4544 192 _xfs_buf_ioapply+0x2cc/0x3b0
> 13) 4352 48 xfs_buf_iorequest+0x6f/0xc0
> 14) 4304 32 xlog_bdstrat+0x23/0x60
> 15) 4272 96 xlog_sync+0x3a4/0x5c0
> 16) 4176 48 xlog_state_release_iclog+0x121/0x130
> 17) 4128 192 xlog_write+0x700/0x7c0
> 18) 3936 192 xlog_cil_push+0x2ae/0x3d0
> 19) 3744 128 xlog_cil_force_lsn+0x1b8/0x1e0
> 20) 3616 144 _xfs_log_force_lsn+0x7c/0x300
> 21) 3472 48 xfs_log_force_lsn+0x3b/0xa0
> 22) 3424 112 xfs_iunpin_wait+0xd7/0x160
> 23) 3312 80 xfs_reclaim_inode+0xd0/0x350
> 24) 3232 432 xfs_reclaim_inodes_ag+0x277/0x3d0
> 25) 2800 48 xfs_reclaim_inodes_nr+0x33/0x40
> 26) 2752 16 xfs_fs_free_cached_objects+0x15/0x20
> 27) 2736 80 super_cache_scan+0x169/0x170
> 28) 2656 160 shrink_slab_node+0x133/0x290
> 29) 2496 80 shrink_slab+0x122/0x160
> 30) 2416 112 do_try_to_free_pages+0x1de/0x360
> 31) 2304 192 try_to_free_pages+0x110/0x190
> 32) 2112 256 __alloc_pages_nodemask+0x5a2/0x8e0
> 33) 1856 80 alloc_pages_current+0xb2/0x170
> 34) 1776 64 new_slab+0x265/0x2e0
> 35) 1712 240 __slab_alloc+0x2fb/0x4c4
> 36) 1472 80 kmem_cache_alloc+0x10b/0x140
>
> That's almost 4700 bytes of stack usage from the
> kmem_cache_alloc(GFP_KERNEL) call because it entered the IO path.
> Yes, it's an extreme case, but if you're looking for a smoking
> gun.... :/
>
> I can fix this one easily - we already have a workqueue for doing
> async log pushes (will split the stack between xlog_cil_force_lsn
> and xlog_cil_push), but the reason we haven't used it for synchronous
> log forces is that screws up fsync performance on CFQ. We don't
> recommend CFQ with XFS anyway, so I think I'll make this change
> anyway.

Dave, the patch below should chop off the stack usage from
xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
Can you given this a run?

Cheers,

Dave.
--
Dave Chinner
[email protected]

xfs: always do log forces via the workqueue

From: Dave Chinner <[email protected]>

Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_log_cil.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 39 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b57a8e0..7e54553 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -499,13 +499,6 @@ xlog_cil_push(
cil->xc_ctx = new_ctx;

/*
- * mirror the new sequence into the cil structure so that we can do
- * unlocked checks against the current sequence in log forces without
- * risking deferencing a freed context pointer.
- */
- cil->xc_current_sequence = new_ctx->sequence;
-
- /*
* The switch is now done, so we can drop the context lock and move out
* of a shared context. We can't just go straight to the commit record,
* though - we need to synchronise with previous and future commits so
@@ -523,8 +516,15 @@ xlog_cil_push(
* Hence we need to add this context to the committing context list so
* that higher sequences will wait for us to write out a commit record
* before they do.
+ *
+ * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+ * structure atomically with the addition of this sequence to the
+ * committing list. This also ensures that we can do unlocked checks
+ * against the current sequence in log forces without risking
+ * deferencing a freed context pointer.
*/
spin_lock(&cil->xc_push_lock);
+ cil->xc_current_sequence = new_ctx->sequence;
list_add(&ctx->committing, &cil->xc_committing);
spin_unlock(&cil->xc_push_lock);
up_write(&cil->xc_ctx_lock);
@@ -662,8 +662,14 @@ xlog_cil_push_background(

}

+/*
+ * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
+ * number that is passed. When it returns, the work will be queued for
+ * @push_seq, but it won't be completed. The caller is expected to do any
+ * waiting for push_seq to complete if it is required.
+ */
static void
-xlog_cil_push_foreground(
+xlog_cil_push_now(
struct xlog *log,
xfs_lsn_t push_seq)
{
@@ -688,10 +694,8 @@ xlog_cil_push_foreground(
}

cil->xc_push_seq = push_seq;
+ queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
spin_unlock(&cil->xc_push_lock);
-
- /* do the push now */
- xlog_cil_push(log);
}

bool
@@ -795,7 +799,8 @@ xlog_cil_force_lsn(
* xlog_cil_push() handles racing pushes for the same sequence,
* so no need to deal with it here.
*/
- xlog_cil_push_foreground(log, sequence);
+restart:
+ xlog_cil_push_now(log, sequence);

/*
* See if we can find a previous sequence still committing.
@@ -803,7 +808,6 @@ xlog_cil_force_lsn(
* before allowing the force of push_seq to go ahead. Hence block
* on commits for those as well.
*/
-restart:
spin_lock(&cil->xc_push_lock);
list_for_each_entry(ctx, &cil->xc_committing, committing) {
if (ctx->sequence > sequence)
@@ -821,6 +825,28 @@ restart:
/* found it! */
commit_lsn = ctx->commit_lsn;
}
+
+ /*
+ * The call to xlog_cil_push_now() executes the push in the background.
+ * Hence by the time we have got here it our sequence may not have been
+ * pushed yet. This is true if the current sequence still matches the
+ * push sequence after the above wait loop and the CIL still contains
+ * dirty objects.
+ *
+ * When the push occurs, it will empty the CIL and
+ * atomically increment the currect sequence past the push sequence and
+ * move it into the committing list. Of course, if the CIL is clean at
+ * the time of the push, it won't have pushed the CIL at all, so in that
+ * case we should try the push for this sequence again from the start
+ * just in case.
+ */
+
+ if (sequence == cil->xc_current_sequence &&
+ !list_empty(&cil->xc_cil)) {
+ spin_unlock(&cil->xc_push_lock);
+ goto restart;
+ }
+
spin_unlock(&cil->xc_push_lock);
return commit_lsn;
}

2014-02-14 13:25:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
>
> > I'll try to put something along those lines together, if you or Oleg don't
> > do it first.
>
> OK, having looked at that stuff...
>
> 1) things become much more compact if we finish conversion to get_signal()
> first.

I have vague memories that Richard sent out a series to convert over all
architectures a while ago. Hopefully he has better memory than I do.

2014-02-14 13:29:41

by Richard Weinberger

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
>> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
>>
>>> I'll try to put something along those lines together, if you or Oleg don't
>>> do it first.
>>
>> OK, having looked at that stuff...
>>
>> 1) things become much more compact if we finish conversion to get_signal()
>> first.
>
> I have vague memories that Richard sent out a series to convert over all
> architectures a while ago. Hopefully he has better memory than I do.

Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
I think this is a good reason for me to start working on that series again.
Stay tuned.

Thanks,
//richard

2014-02-14 15:20:50

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Fri, Feb 14, 2014 at 02:29:30PM +0100, Richard Weinberger wrote:
> Am 14.02.2014 14:25, schrieb Christoph Hellwig:
> > On Thu, Feb 13, 2014 at 08:51:46PM +0000, Al Viro wrote:
> >> On Wed, Feb 12, 2014 at 09:44:11PM +0000, Al Viro wrote:
> >>
> >>> I'll try to put something along those lines together, if you or Oleg don't
> >>> do it first.
> >>
> >> OK, having looked at that stuff...
> >>
> >> 1) things become much more compact if we finish conversion to get_signal()
> >> first.
> >
> > I have vague memories that Richard sent out a series to convert over all
> > architectures a while ago. Hopefully he has better memory than I do.
>
> Yeah. Sending v2 of that series is on my overflowing TODO list. :-\
> I think this is a good reason for me to start working on that series again.
> Stay tuned.

Would be great. I have several done here, but I'll be glad to replace them
with something tested...

BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
separately? IOW, why do we want the upper 16 bits of ->si_code exposed?
It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

2014-02-14 16:01:42

by Dave Jones

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:

> > I can fix this one easily - we already have a workqueue for doing
> > async log pushes (will split the stack between xlog_cil_force_lsn
> > and xlog_cil_push), but the reason we haven't used it for synchronous
> > log forces is that screws up fsync performance on CFQ. We don't
> > recommend CFQ with XFS anyway, so I think I'll make this change
> > anyway.
>
> Dave, the patch below should chop off the stack usage from
> xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> Can you given this a run?

Looks like it's survived an overnight run..

Dave

2014-02-14 16:09:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/14, Al Viro wrote:
>
> BTW, Oleg, could you explain why does PTRACE_PEEK_SIGINFO copy ->si_code
> separately? IOW, why do we want the upper 16 bits of ->si_code exposed?
> It used to be a strictly internal thing IIRC (it's been what, 2.3.late?)

Yes, but checkpoint/restart tools want to dump/restore the internal part
of ->si_code too.

See also the "task_pid_vnr(current) == pid" hack in do_rt_sigqueueinfo(),
this allows to queue a SI_FROMKERNEL() siginfo.

Oleg.

2014-02-14 16:13:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> We'd have to teach each user of "dequeue_signal()" to free the siginfo
> thing. Which shouldn't be too bad - I think we've collected all of
> that into generic code, and there isn't the mass or architecture code
> that knows about these things any more. But there are a few odd
> drivers etc and signalfd.

The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
All of these have in common that they try to handle signals in a kernel
thread (which we don't even allow by default), and that they ignore the
siginfo. I think they could mostly be replaced by an addition to the
kthread API to allow a kthread to be killed by signals for legacy
reasons.

2014-02-14 16:16:34

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Fri, Feb 14, 2014 at 08:13:02AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 12, 2014 at 01:32:55PM -0800, Linus Torvalds wrote:
> > We'd have to teach each user of "dequeue_signal()" to free the siginfo
> > thing. Which shouldn't be too bad - I think we've collected all of
> > that into generic code, and there isn't the mass or architecture code
> > that knows about these things any more. But there are a few odd
> > drivers etc and signalfd.
>
> The few odd drivers are nbd, jffs2 and the usb mass storage gadget.
> All of these have in common that they try to handle signals in a kernel
> thread (which we don't even allow by default), and that they ignore the
> siginfo. I think they could mostly be replaced by an addition to the
> kthread API to allow a kthread to be killed by signals for legacy
> reasons.

FWIW, there's a funny situation - all users of dequeue_signal_lock()
actually ignore info completely. I'm not saying that we ought to
stop returning it, but e.g. jbd part of that patch is simply
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
/* Put_super will send a SIGKILL and then wait on the sem.
*/
while (signal_pending(current) || freezing(current)) {
- siginfo_t info;
+ ksiginfo_t info;
unsigned long signr;

if (try_to_freeze())
goto again;

signr = dequeue_signal_lock(current, &current->blocked, &info);
+ dismiss_siginfo(&info);

switch(signr) {
case SIGSTOP:

Not complicated at all. Where it does get complicated is ->last_siginfo and
PTRACE_SETSIGINFO - getting that reasonably clean is what I'm still fighting
right now...

2014-02-14 16:18:07

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Fri, Feb 14, 2014 at 04:16:24PM +0000, Al Viro wrote:
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely. I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

s/jbd/jffs2/, obviously. Sorry... And yes, nbd and usb_storage are the
same story.

2014-02-14 16:19:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Fri, Feb 14, 2014 at 04:16:24PM +0000, Al Viro wrote:
> > All of these have in common that they try to handle signals in a kernel
> > thread (which we don't even allow by default), and that they ignore the
> > siginfo. I think they could mostly be replaced by an addition to the
> > kthread API to allow a kthread to be killed by signals for legacy
> > reasons.
>
> FWIW, there's a funny situation - all users of dequeue_signal_lock()
> actually ignore info completely. I'm not saying that we ought to
> stop returning it, but e.g. jbd part of that patch is simply

Might aswell stick the discmiss into what was dequeue_signal_lock().
Which at that point should get a saner name (maybe thread_dequeue_signal ?)
and lose all argument except maybe task_struct - not that it's
nessecary, but it would mirror the other functions usually used around
it.

2014-02-15 05:25:42

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Thu, Feb 13, 2014 at 09:58:47AM -0800, Linus Torvalds wrote:
> On Thu, Feb 13, 2014 at 9:40 AM, Oleg Nesterov <[email protected]> wrote:
> >
> > And we should be careful with SIGQUEUE_PREALLOC, at least
> > collect_signal() should not do list_del_init()... Plus we need to
> > handle the SEND_SIG_FORCED-like case.
>
> I don't think the users need to care. They'd just call
> "sigqueue_free()" not knowing about our preallocations etc. That kind
> of detail should be confined to inside signal.c.
>
> But there really aren't that many users. There's a couple of
> "dequeue_signal_lock()" users, but they don't actually *want* the
> siginfo at all (they're kernel threads), so we can just make that
> function free the siginfo immediately (and get rid of the totally
> unnecessay kernel stack allocation). And outside of signal.c only
> signalfd uses "dequeue_signal()" itself, and that would be the only
> one that would need to be taught to use (in signalfd_copyinfo()) and
> then free the sigqueue entry.
>
> So it really looks like the right thing to do, and fairly
> straightforward. But I'm leaving the coding proof to Al, since he
> already offered ;)

OK, _very_ preliminary patch follows. It's uglier than it has to and it's
not in the form I would consider suitable for merge. It doesn't depend
on conversions of architectures to use of ksignal as prereqs; instead of
that it has config symbol (ARCH_USES_KSIGNAL) that can be selected on
architectures already through that conversion (in this patch - just on
x86, since that was all I could test on at the moment; e.g. alpha and
arm could also get such selects right now). If it isn't selected, everything
builds as usual. With large siginfo on stack for signal path, as before.
We get ksiginfo_t as an alias for siginfo_t in that case, dismiss_siginfo()
is a no-op, etc. If it _is_ selected, ksiginfo_t is much smaller than
siginfo_t.

Of course, we pay for that with ifdefs in places that need to understand
[k]siginfo_t guts; signalfd_copyinfo(), copy_siginfo_from_user{,32}() and
copy_ksiginfo_to_user{,32}(), tracepoint for deliver_signal(), collect_signal(),
trivial ones in dequeue_signal() and an ugly bugger in ptrace_signal().
No way around that until all architectures are converted. Other places
get away with some siginfo_t replaced with ksiginfo_t and dismiss_siginfo()
added in some places. arch/x86/ia32/ia32_signal.c needed a bit of change
(we probably ought to unify copy_siginfo_from_user32() on all architectures
that have it, actually).

The sucker survived LTP syscall tests and hasn't died on xfstests yet either
(== doesn't seem to leak horribly). Consider it a proof-of-concept and no
more than that. One thing I'm really not satisfied with is signal sending
pathway - it would probably make a lot of sense to use ksiginfo there as
well, and have insertion into queue simply steal info->q instead of allocating
and copying. Another is that ifdefs like that are tolerable only if we
plan to convert everything during this cycle; I believe it to be doable,
actually. Having that sucker go before the conversions would make for
simpler logistics wrt architecture trees, but it may be better to get
all conversions merged first. Not sure... Anyway, here is it; comments
are welcome.

Signed-off-by: Al Viro <[email protected]>
---
diff --git a/arch/Kconfig b/arch/Kconfig
index 80bbb8c..f753561 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -522,4 +522,7 @@ config OLD_SIGACTION
config COMPAT_OLD_SIGACTION
bool

+config ARCH_USES_KSIGNAL
+ bool
+
source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0af5250..4ec468d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,6 +127,7 @@ config X86
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
select HAVE_CC_STACKPROTECTOR
+ select ARCH_USES_KSIGNAL

config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 2206757..611a03e 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -34,7 +34,7 @@
#include <asm/sys_ia32.h>
#include <asm/smap.h>

-int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
{
int err = 0;
bool ia32 = test_thread_flag(TIF_IA32);
@@ -105,6 +105,25 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from)
return err;
}

+int copy_siginfo_to_user32(compat_siginfo_t __user *to, const ksiginfo_t *from)
+{
+ int err = 0;
+ if (!from->q) {
+ if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t)))
+ return -EFAULT;
+ put_user_try {
+ /* fast case */
+ put_user_ex(from->si_signo, &to->si_signo);
+ put_user_ex(0, &to->si_errno);
+ put_user_ex((short)from->si_code, &to->si_code);
+ put_user_ex(from->si_pid, &to->si_pid);
+ put_user_ex(from->si_uid, &to->si_uid);
+ } put_user_catch(err);
+ return err;
+ }
+ return __copy_siginfo_to_user32(to, &from->q->info);
+}
+
int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from)
{
int err = 0;
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 55298db..673ab73 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -201,10 +201,11 @@ static int sock_xmit(struct nbd_device *nbd, int send, void *buf, int size,
msg.msg_flags);

if (signal_pending(current)) {
- siginfo_t info;
+ ksiginfo_t info;
printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
task_pid_nr(current), current->comm,
dequeue_signal_lock(current, &current->blocked, &info));
+ dismiss_siginfo(&info);
result = -EINTR;
sock_shutdown(nbd, !send);
break;
diff --git a/drivers/usb/gadget/f_mass_storage.c b/drivers/usb/gadget/f_mass_storage.c
index b963939..ecacac5 100644
--- a/drivers/usb/gadget/f_mass_storage.c
+++ b/drivers/usb/gadget/f_mass_storage.c
@@ -2337,7 +2337,7 @@ static void fsg_disable(struct usb_function *f)

static void handle_exception(struct fsg_common *common)
{
- siginfo_t info;
+ ksiginfo_t info;
int i;
struct fsg_buffhd *bh;
enum fsg_state old_state;
@@ -2351,6 +2351,7 @@ static void handle_exception(struct fsg_common *common)
for (;;) {
int sig =
dequeue_signal_lock(current, &current->blocked, &info);
+ dismiss_siginfo(&info);
if (!sig)
break;
if (sig != SIGUSR1) {
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 67be295..e79a23e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1371,7 +1371,7 @@ static void fill_auxv_note(struct memelfnote *note, struct mm_struct *mm)
}

static void fill_siginfo_note(struct memelfnote *note, user_siginfo_t *csigdata,
- const siginfo_t *siginfo)
+ const ksiginfo_t *siginfo)
{
mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS);
@@ -1578,7 +1578,7 @@ static int fill_thread_core_info(struct elf_thread_core_info *t,

static int fill_note_info(struct elfhdr *elf, int phdrs,
struct elf_note_info *info,
- const siginfo_t *siginfo, struct pt_regs *regs)
+ const ksiginfo_t *siginfo, struct pt_regs *regs)
{
struct task_struct *dump_task = current;
const struct user_regset_view *view = task_user_regset_view(dump_task);
@@ -1827,7 +1827,7 @@ static int elf_note_info_init(struct elf_note_info *info)

static int fill_note_info(struct elfhdr *elf, int phdrs,
struct elf_note_info *info,
- const siginfo_t *siginfo, struct pt_regs *regs)
+ const ksiginfo_t *siginfo, struct pt_regs *regs)
{
struct list_head *t;
struct core_thread *ct;
diff --git a/fs/coredump.c b/fs/coredump.c
index e3ad709..44ff98d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -484,7 +484,7 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
return err;
}

-void do_coredump(const siginfo_t *siginfo)
+void do_coredump(const ksiginfo_t *siginfo)
{
struct core_state core_state;
struct core_name cn;
diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 2b60ce1..aefdff2 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -119,13 +119,14 @@ static int jffs2_garbage_collect_thread(void *_c)
/* Put_super will send a SIGKILL and then wait on the sem.
*/
while (signal_pending(current) || freezing(current)) {
- siginfo_t info;
+ ksiginfo_t info;
unsigned long signr;

if (try_to_freeze())
goto again;

signr = dequeue_signal_lock(current, &current->blocked, &info);
+ dismiss_siginfo(&info);

switch(signr) {
case SIGSTOP:
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 424b7b6..d6f3c47 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -78,8 +78,9 @@ static unsigned int signalfd_poll(struct file *file, poll_table *wait)
* Copied from copy_siginfo_to_user() in kernel/signal.c
*/
static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
- siginfo_t const *kinfo)
+ const ksiginfo_t *from)
{
+ const siginfo_t *kinfo;
long err;

BUILD_BUG_ON(sizeof(struct signalfd_siginfo) != 128);
@@ -89,6 +90,19 @@ static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
*/
err = __clear_user(uinfo, sizeof(*uinfo));

+#ifdef CONFIG_ARCH_USES_KSIGNAL
+ if (!from->q) {
+ err |= __put_user(from->si_signo, &uinfo->ssi_signo);
+ err |= __put_user(0, &uinfo->ssi_errno);
+ err |= __put_user((short) from->si_code, &uinfo->ssi_code);
+ err |= __put_user(from->si_pid, &uinfo->ssi_pid);
+ err |= __put_user(from->si_uid, &uinfo->ssi_uid);
+ goto out;
+ }
+ kinfo = &from->q->info;
+#else
+ kinfo = from;
+#endif
/*
* If you change siginfo_t structure, please be sure
* this code is fixed accordingly.
@@ -152,10 +166,11 @@ static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
break;
}

+out:
return err ? -EFAULT: sizeof(*uinfo);
}

-static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, siginfo_t *info,
+static ssize_t signalfd_dequeue(struct signalfd_ctx *ctx, ksiginfo_t *info,
int nonblock)
{
ssize_t ret;
@@ -207,7 +222,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
struct signalfd_siginfo __user *siginfo;
int nonblock = file->f_flags & O_NONBLOCK;
ssize_t ret, total = 0;
- siginfo_t info;
+ ksiginfo_t info;

count /= sizeof(struct signalfd_siginfo);
if (!count)
@@ -219,6 +234,7 @@ static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
if (unlikely(ret <= 0))
break;
ret = signalfd_copyinfo(siginfo, &info);
+ dismiss_siginfo(&info);
if (ret < 0)
break;
siginfo++;
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 3d1a3af..43a6895 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -15,6 +15,11 @@
#define __SI_CODE(T,N) ((T) | ((N) & 0xffff))

struct siginfo;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+#define small_siginfo siginfo
+#else
+struct small_siginfo;
+#endif
void do_schedule_next_timer(struct siginfo *info);

#ifndef HAVE_ARCH_COPY_SIGINFO
@@ -32,6 +37,8 @@ static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)

#endif

-extern int copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo *from);
+extern int copy_ksiginfo_from_user(struct small_siginfo *to, const struct siginfo __user *from);
+extern int copy_siginfo_to_user(struct siginfo __user *to, const struct small_siginfo *from);
+extern int __copy_siginfo_to_user(struct siginfo __user *to, const struct siginfo *from);

#endif
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index b4a745d..912d8d4 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -56,7 +56,7 @@ struct linux_binprm {

/* Function parameter for binfmt->coredump */
struct coredump_params {
- const siginfo_t *siginfo;
+ const ksiginfo_t *siginfo;
struct pt_regs *regs;
struct file *file;
unsigned long limit;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3f448c6..9ab31c9 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -362,7 +362,9 @@ long compat_get_bitmap(unsigned long *mask, const compat_ulong_t __user *umask,
long compat_put_bitmap(compat_ulong_t __user *umask, unsigned long *mask,
unsigned long bitmap_size);
int copy_siginfo_from_user32(siginfo_t *to, struct compat_siginfo __user *from);
-int copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
+int copy_ksiginfo_from_user32(ksiginfo_t *to, struct compat_siginfo __user *from);
+int copy_siginfo_to_user32(struct compat_siginfo __user *to, const ksiginfo_t *from);
+int __copy_siginfo_to_user32(struct compat_siginfo __user *to, const siginfo_t *from);
int get_compat_sigevent(struct sigevent *event,
const struct compat_sigevent __user *u_event);
long compat_sys_rt_tgsigqueueinfo(compat_pid_t tgid, compat_pid_t pid, int sig,
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index d016a12..3c9f79c 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -15,9 +15,9 @@ extern int dump_skip(struct coredump_params *cprm, size_t nr);
extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
extern int dump_align(struct coredump_params *cprm, int align);
#ifdef CONFIG_COREDUMP
-extern void do_coredump(const siginfo_t *siginfo);
+extern void do_coredump(const ksiginfo_t *siginfo);
#else
-static inline void do_coredump(const siginfo_t *siginfo) {}
+static inline void do_coredump(const ksiginfo_t *siginfo) {}
#endif

#endif /* _LINUX_COREDUMP_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec..60834bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1428,7 +1428,7 @@ struct task_struct {
struct io_context *io_context;

unsigned long ptrace_message;
- siginfo_t *last_siginfo; /* For ptrace use. */
+ ksiginfo_t *last_siginfo; /* For ptrace use. */
struct task_io_accounting ioac;
#if defined(CONFIG_TASK_XACCT)
u64 acct_rss_mem1; /* accumulated rss usage */
@@ -2177,9 +2177,9 @@ extern void flush_signals(struct task_struct *);
extern void __flush_signals(struct task_struct *);
extern void ignore_signals(struct task_struct *);
extern void flush_signal_handlers(struct task_struct *, int force_default);
-extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info);
+extern int dequeue_signal(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info);

-static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+static inline int dequeue_signal_lock(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info)
{
unsigned long flags;
int ret;
@@ -2204,7 +2204,6 @@ extern int kill_pid_info_as_cred(int, struct siginfo *, struct pid *,
const struct cred *, u32);
extern int kill_pgrp(struct pid *pid, int sig, int priv);
extern int kill_pid(struct pid *pid, int sig, int priv);
-extern int kill_proc_info(int, struct siginfo *, pid_t);
extern __must_check bool do_notify_parent(struct task_struct *, int);
extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
extern void force_sig(int, struct task_struct *);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..c999b4a 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -236,12 +236,29 @@ static inline int valid_signal(unsigned long sig)
struct timespec;
struct pt_regs;

+#ifdef CONFIG_ARCH_USES_KSIGNAL
+typedef struct small_siginfo {
+ int si_signo;
+ int si_code;
+ union { struct { /* yecch... */
+ __kernel_pid_t _pid;
+ __ARCH_SI_UID_T _uid;
+ } _kill; } _sifields;
+ struct sigqueue *q;
+} ksiginfo_t;
+extern void dismiss_siginfo(ksiginfo_t *);
+#else
+#define small_siginfo siginfo
+#define ksiginfo_t siginfo_t
+#define dismiss_siginfo(info) ((void)0)
+#endif
+
extern int next_signal(struct sigpending *pending, sigset_t *mask);
extern int do_send_sig_info(int sig, struct siginfo *info,
struct task_struct *p, bool group);
extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
-extern int do_sigtimedwait(const sigset_t *, siginfo_t *,
+extern int do_sigtimedwait(const sigset_t *, ksiginfo_t *,
const struct timespec *);
extern int sigprocmask(int, sigset_t *, sigset_t *);
extern void set_current_blocked(sigset_t *);
@@ -281,13 +298,13 @@ struct old_sigaction {

struct ksignal {
struct k_sigaction ka;
- siginfo_t info;
+ ksiginfo_t info;
int sig;
};

-extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
+extern int get_signal_to_deliver(ksiginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
-extern void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
+extern void signal_delivered(int sig, ksiginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
extern void exit_signals(struct task_struct *tsk);

/*
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 1e98b55..c9e7a0e 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -146,7 +146,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
* Called without locks, shortly before returning to user mode
* (or handling more signals).
*/
-static inline void tracehook_signal_handler(int sig, siginfo_t *info,
+static inline void tracehook_signal_handler(int sig, ksiginfo_t *info,
const struct k_sigaction *ka,
struct pt_regs *regs, int stepping)
{
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 39a8a43..1494a3e 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -23,6 +23,27 @@
} \
} while (0)

+#ifndef CONFIG_ARCH_USES_KSIGNAL
+#define TP_SI_ERRNO(info) info->si_errno
+#else
+#define TP_SI_ERRNO(info) info->q ? info->q->info.si_errno : 0
+#endif
+
+#define TP_STORE_KSIGINFO(__entry, info) \
+ do { \
+ if (info == (ksiginfo_t *)SEND_SIG_NOINFO || \
+ info == (ksiginfo_t *)SEND_SIG_FORCED) { \
+ __entry->errno = 0; \
+ __entry->code = SI_USER; \
+ } else if (info == (ksiginfo_t *)SEND_SIG_PRIV) {\
+ __entry->errno = 0; \
+ __entry->code = SI_KERNEL; \
+ } else { \
+ __entry->errno = TP_SI_ERRNO(info); \
+ __entry->code = info->si_code; \
+ } \
+ } while (0)
+
#ifndef TRACE_HEADER_MULTI_READ
enum {
TRACE_SIGNAL_DELIVERED,
@@ -95,7 +116,7 @@ TRACE_EVENT(signal_generate,
*/
TRACE_EVENT(signal_deliver,

- TP_PROTO(int sig, struct siginfo *info, struct k_sigaction *ka),
+ TP_PROTO(int sig, ksiginfo_t *info, struct k_sigaction *ka),

TP_ARGS(sig, info, ka),

@@ -109,7 +130,7 @@ TRACE_EVENT(signal_deliver,

TP_fast_assign(
__entry->sig = sig;
- TP_STORE_SIGINFO(__entry, info);
+ TP_STORE_KSIGINFO(__entry, info);
__entry->sa_handler = (unsigned long)ka->sa.sa_handler;
__entry->sa_flags = ka->sa.sa_flags;
),
diff --git a/kernel/compat.c b/kernel/compat.c
index 0a09e48..82084b4 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -579,7 +579,7 @@ COMPAT_SYSCALL_DEFINE5(waitid,

BUG_ON(info.si_code & __SI_MASK);
info.si_code |= __SI_CHLD;
- return copy_siginfo_to_user32(uinfo, &info);
+ return __copy_siginfo_to_user32(uinfo, &info);
}

static int compat_get_user_cpu_mask(compat_ulong_t __user *user_mask_ptr,
@@ -981,7 +981,7 @@ COMPAT_SYSCALL_DEFINE4(rt_sigtimedwait, compat_sigset_t __user *, uthese,
compat_sigset_t s32;
sigset_t s;
struct timespec t;
- siginfo_t info;
+ ksiginfo_t info;
long ret;

if (sigsetsize != sizeof(sigset_t))
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 1f4bcb3..11223d4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -589,7 +589,7 @@ static int ptrace_setoptions(struct task_struct *child, unsigned long data)
return 0;
}

-static int ptrace_getsiginfo(struct task_struct *child, siginfo_t *info)
+static int ptrace_getsiginfo(struct task_struct *child, ksiginfo_t *info)
{
unsigned long flags;
int error = -ESRCH;
@@ -605,7 +605,7 @@ static int ptrace_getsiginfo(struct task_struct *child, siginfo_t *info)
return error;
}

-static int ptrace_setsiginfo(struct task_struct *child, const siginfo_t *info)
+static int ptrace_setsiginfo(struct task_struct *child, ksiginfo_t *info)
{
unsigned long flags;
int error = -ESRCH;
@@ -613,11 +613,14 @@ static int ptrace_setsiginfo(struct task_struct *child, const siginfo_t *info)
if (lock_task_sighand(child, &flags)) {
error = -EINVAL;
if (likely(child->last_siginfo != NULL)) {
+ dismiss_siginfo(child->last_siginfo);
*child->last_siginfo = *info;
error = 0;
}
unlock_task_sighand(child, &flags);
}
+ if (error)
+ dismiss_siginfo(info);
return error;
}

@@ -666,7 +669,7 @@ static int ptrace_peek_siginfo(struct task_struct *child,
if (unlikely(is_compat_task())) {
compat_siginfo_t __user *uinfo = compat_ptr(data);

- if (copy_siginfo_to_user32(uinfo, &info) ||
+ if (__copy_siginfo_to_user32(uinfo, &info) ||
__put_user(info.si_code, &uinfo->si_code)) {
ret = -EFAULT;
break;
@@ -677,7 +680,7 @@ static int ptrace_peek_siginfo(struct task_struct *child,
{
siginfo_t __user *uinfo = (siginfo_t __user *) data;

- if (copy_siginfo_to_user(uinfo, &info) ||
+ if (__copy_siginfo_to_user(uinfo, &info) ||
__put_user(info.si_code, &uinfo->si_code)) {
ret = -EFAULT;
break;
@@ -805,7 +808,7 @@ int ptrace_request(struct task_struct *child, long request,
{
bool seized = child->ptrace & PT_SEIZED;
int ret = -EIO;
- siginfo_t siginfo, *si;
+ ksiginfo_t siginfo, *si;
void __user *datavp = (void __user *) data;
unsigned long __user *datalp = datavp;
unsigned long flags;
@@ -839,9 +842,8 @@ int ptrace_request(struct task_struct *child, long request,
break;

case PTRACE_SETSIGINFO:
- if (copy_from_user(&siginfo, datavp, sizeof siginfo))
- ret = -EFAULT;
- else
+ ret = copy_ksiginfo_from_user(&siginfo, datavp);
+ if (!ret)
ret = ptrace_setsiginfo(child, &siginfo);
break;

@@ -1107,7 +1109,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request,
{
compat_ulong_t __user *datap = compat_ptr(data);
compat_ulong_t word;
- siginfo_t siginfo;
+ ksiginfo_t siginfo;
int ret;

switch (request) {
@@ -1139,8 +1141,7 @@ int compat_ptrace_request(struct task_struct *child, compat_long_t request,
break;

case PTRACE_SETSIGINFO:
- memset(&siginfo, 0, sizeof siginfo);
- if (copy_siginfo_from_user32(
+ if (copy_ksiginfo_from_user32(
&siginfo, (struct compat_siginfo __user *) datap))
ret = -EFAULT;
else
diff --git a/kernel/signal.c b/kernel/signal.c
index 48cf9fc..2b641aa 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -540,7 +540,18 @@ unblock_all_signals(void)
spin_unlock_irqrestore(&current->sighand->siglock, flags);
}

-static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+void dismiss_siginfo(ksiginfo_t *info)
+{
+ if (info->q)
+ __sigqueue_free(info->q);
+ info->q = NULL;
+}
+EXPORT_SYMBOL(dismiss_siginfo);
+#endif
+
+static void collect_signal(int sig, struct sigpending *list,
+ ksiginfo_t *info)
{
struct sigqueue *q, *first = NULL;

@@ -561,8 +572,14 @@ static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
if (first) {
still_pending:
list_del_init(&first->list);
+#ifndef CONFIG_ARCH_USES_KSIGNAL
copy_siginfo(info, &first->info);
__sigqueue_free(first);
+#else
+ info->si_signo = sig;
+ info->si_code = first->info.si_code;
+ info->q = first;
+#endif
} else {
/*
* Ok, it wasn't in the queue. This must be
@@ -570,15 +587,19 @@ still_pending:
* out of queue space. So zero out the info.
*/
info->si_signo = sig;
- info->si_errno = 0;
info->si_code = SI_USER;
info->si_pid = 0;
info->si_uid = 0;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+ info->si_errno = 0;
+#else
+ info->q = NULL;
+#endif
}
}

static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
- siginfo_t *info)
+ ksiginfo_t *info)
{
int sig = next_signal(pending, mask);

@@ -604,7 +625,7 @@ static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
*
* All callers have to hold the siglock.
*/
-int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
+int dequeue_signal(struct task_struct *tsk, sigset_t *mask, ksiginfo_t *info)
{
int signr;

@@ -659,7 +680,12 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
*/
current->jobctl |= JOBCTL_STOP_DEQUEUED;
}
+#ifndef CONFIG_ARCH_USES_KSIGNAL
if ((info->si_code & __SI_MASK) == __SI_TIMER && info->si_sys_private) {
+#else
+ if ((info->si_code & __SI_MASK) == __SI_TIMER &&
+ info->q->info.si_sys_private) {
+#endif
/*
* Release the siglock to ensure proper locking order
* of timer locks outside of siglocks. Note, we leave
@@ -667,7 +693,11 @@ int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
* about to disable them again anyway.
*/
spin_unlock(&tsk->sighand->siglock);
+#ifndef CONFIG_ARCH_USES_KSIGNAL
do_schedule_next_timer(info);
+#else
+ do_schedule_next_timer(&info->q->info);
+#endif
spin_lock(&tsk->sighand->siglock);
}
return signr;
@@ -1836,7 +1866,8 @@ static int sigkill_pending(struct task_struct *tsk)
* If we actually decide not to stop at all because the tracer
* is gone, we keep current->exit_code unless clear_code.
*/
-static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)
+static void ptrace_stop(int exit_code, int why, int clear_code,
+ ksiginfo_t *info)
__releases(&current->sighand->siglock)
__acquires(&current->sighand->siglock)
{
@@ -1960,16 +1991,17 @@ static void ptrace_stop(int exit_code, int why, int clear_code, siginfo_t *info)

static void ptrace_do_notify(int signr, int exit_code, int why)
{
- siginfo_t info;
+ ksiginfo_t info;

- memset(&info, 0, sizeof info);
info.si_signo = signr;
info.si_code = exit_code;
info.si_pid = task_pid_vnr(current);
info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+ info.q = NULL;

/* Let the debugger run. */
ptrace_stop(exit_code, why, 1, &info);
+ dismiss_siginfo(&info);
}

void ptrace_notify(int exit_code)
@@ -2141,7 +2173,7 @@ static void do_jobctl_trap(void)
}
}

-static int ptrace_signal(int signr, siginfo_t *info)
+static int ptrace_signal(int signr, ksiginfo_t *info)
{
ptrace_signal_deliver();
/*
@@ -2170,8 +2202,11 @@ static int ptrace_signal(int signr, siginfo_t *info)
* have updated *info via PTRACE_SETSIGINFO.
*/
if (signr != info->si_signo) {
+ dismiss_siginfo(info);
info->si_signo = signr;
+#ifndef CONFIG_ARCH_USES_KSIGNAL
info->si_errno = 0;
+#endif
info->si_code = SI_USER;
rcu_read_lock();
info->si_pid = task_pid_vnr(current->parent);
@@ -2182,14 +2217,25 @@ static int ptrace_signal(int signr, siginfo_t *info)

/* If the (new) signal is now blocked, requeue it. */
if (sigismember(&current->blocked, signr)) {
+#ifndef CONFIG_ARCH_USES_KSIGNAL
specific_send_sig_info(signr, info, current);
+#else
+ if (!info->q) {
+ struct siginfo s = {.si_code = info->si_code,
+ .si_signo = signr,
+ .si_pid = info->si_pid,
+ .si_uid = info->si_uid};
+ specific_send_sig_info(signr, &s, current);
+ } else
+ specific_send_sig_info(signr, &info->q->info, current);
+#endif
signr = 0;
}

return signr;
}

-int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
+int get_signal_to_deliver(ksiginfo_t *info, struct k_sigaction *return_ka,
struct pt_regs *regs, void *cookie)
{
struct sighand_struct *sighand = current->sighand;
@@ -2363,6 +2409,7 @@ relock:
*/
do_coredump(info);
}
+ dismiss_siginfo(info);

/*
* Death signals, no core dump.
@@ -2377,7 +2424,7 @@ relock:
/**
* signal_delivered -
* @sig: number of signal being delivered
- * @info: siginfo_t of signal being delivered
+ * @info: ksiginfo_t of signal being delivered
* @ka: sigaction setting that chose the handler
* @regs: user register state
* @stepping: nonzero if debugger single-step or block-step in use
@@ -2387,7 +2434,7 @@ relock:
* is always blocked, and the signal itself is blocked unless %SA_NODEFER
* is set in @ka->sa.sa_flags. Tracing is notified.
*/
-void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka,
+void signal_delivered(int sig, ksiginfo_t *info, struct k_sigaction *ka,
struct pt_regs *regs, int stepping)
{
sigset_t blocked;
@@ -2412,6 +2459,7 @@ void signal_setup_done(int failed, struct ksignal *ksig, int stepping)
else
signal_delivered(ksig->sig, &ksig->info, &ksig->ka,
signal_pt_regs(), stepping);
+ dismiss_siginfo(&ksig->info);
}

/*
@@ -2721,7 +2769,7 @@ COMPAT_SYSCALL_DEFINE2(rt_sigpending, compat_sigset_t __user *, uset,
}
#endif

-int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
+int __copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
{
int err;

@@ -2804,13 +2852,35 @@ int copy_siginfo_to_user(siginfo_t __user *to, const siginfo_t *from)
return err;
}

+int copy_siginfo_to_user(siginfo_t __user *to, const ksiginfo_t *from)
+{
+ const siginfo_t *info;
+#ifdef CONFIG_ARCH_USES_KSIGNAL
+ if (!from->q) {
+ int err;
+ if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t)))
+ return -EFAULT;
+ err = __put_user(from->si_signo, &to->si_signo);
+ err |= __put_user(0, &to->si_errno);
+ err |= __put_user((short) from->si_code, &to->si_code);
+ err |= __put_user(from->si_pid, &to->si_pid);
+ err |= __put_user(from->si_uid, &to->si_uid);
+ return err;
+ }
+ info = &from->q->info;
+#else
+ info = from;
+#endif
+ return __copy_siginfo_to_user(to, info);
+}
+
/**
* do_sigtimedwait - wait for queued signals specified in @which
* @which: queued signals to wait for
* @info: if non-null, the signal's siginfo is returned here
* @ts: upper bound on process time suspension
*/
-int do_sigtimedwait(const sigset_t *which, siginfo_t *info,
+int do_sigtimedwait(const sigset_t *which, ksiginfo_t *info,
const struct timespec *ts)
{
struct task_struct *tsk = current;
@@ -2878,7 +2948,7 @@ SYSCALL_DEFINE4(rt_sigtimedwait, const sigset_t __user *, uthese,
{
sigset_t these;
struct timespec ts;
- siginfo_t info;
+ ksiginfo_t info;
int ret;

/* XXX: Don't preclude handling different sized sigset_t's. */
@@ -2999,6 +3069,73 @@ SYSCALL_DEFINE2(tkill, pid_t, pid, int, sig)
return do_tkill(0, pid, sig);
}

+int copy_ksiginfo_from_user(ksiginfo_t *to, const siginfo_t __user *from)
+{
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+ return copy_from_user(to, from, sizeof(siginfo_t)) ? -EFAULT : 0;
+#else
+ struct sigqueue *q;
+ if (!access_ok(VERIFY_READ, from, sizeof(siginfo_t)))
+ return -EFAULT;
+ if (__get_user(to->si_code, &from->si_code))
+ return -EFAULT;
+ if (to->si_code == SI_USER) {
+ if (__get_user(to->si_signo, &from->si_signo) ||
+ __get_user(to->si_uid, &from->si_uid) ||
+ __get_user(to->si_pid, &from->si_pid))
+ return -EFAULT;
+ to->q = NULL;
+ return 0;
+ }
+ q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
+ if (!q)
+ return -ENOMEM;
+ if (copy_from_user(&q->info, from, sizeof(siginfo_t))) {
+ __sigqueue_free(q);
+ return -EFAULT;
+ }
+ to->q = q;
+ to->si_code = q->info.si_code;
+ to->si_signo = q->info.si_signo;
+ return 0;
+#endif
+}
+
+#ifdef CONFIG_COMPAT
+
+int copy_ksiginfo_from_user32(ksiginfo_t *to, compat_siginfo_t __user *from)
+{
+#ifndef CONFIG_ARCH_USES_KSIGNAL
+ return copy_siginfo_from_user32(to, from);
+#else
+ struct sigqueue *q;
+ if (!access_ok(VERIFY_READ, from, sizeof(compat_siginfo_t)))
+ return -EFAULT;
+ if (__get_user(to->si_code, &from->si_code))
+ return -EFAULT;
+ if (to->si_code == SI_USER) {
+ if (__get_user(to->si_signo, &from->si_signo) ||
+ __get_user(to->si_uid, &from->si_uid) ||
+ __get_user(to->si_pid, &from->si_pid))
+ return -EFAULT;
+ to->q = NULL;
+ return 0;
+ }
+ q = __sigqueue_alloc(-1, current, GFP_KERNEL, 0);
+ if (!q)
+ return -ENOMEM;
+ if (copy_siginfo_from_user32(&q->info, from)) {
+ __sigqueue_free(q);
+ return -EFAULT;
+ }
+ to->q = q;
+ to->si_code = q->info.si_code;
+ to->si_signo = q->info.si_signo;
+ return 0;
+#endif
+}
+#endif
+
static int do_rt_sigqueueinfo(pid_t pid, int sig, siginfo_t *info)
{
/* Not even root can pretend to send signals from the kernel.

2014-02-15 14:27:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/15, Al Viro wrote:
>
> OK, _very_ preliminary patch follows. It's uglier than it has to

And I'm afraid it needs more uglifications...

> +void dismiss_siginfo(ksiginfo_t *info)
> +{
> + if (info->q)
> + __sigqueue_free(info->q);
> + info->q = NULL;
> +}

1. info->q can be already freed if SIGQUEUE_PREALLOC.

Once get_signal_to_deliver() or any other caller drops ->siglock
another thread can do sys_timer_delete()->sigqueue_free().

2. We need to move do_schedule_next_timer() from dequeue_signal()
here.

Otherwise ->q can be reused/overwritten by the next send_sigqueue()
right affter ->siglock is dropped.

Oleg.

2014-02-15 14:47:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/14, Christoph Hellwig wrote:
>
> Might aswell stick the discmiss into what was dequeue_signal_lock().
> Which at that point should get a saner name (maybe thread_dequeue_signal ?)
> and lose all argument except maybe task_struct

No, task_struct argument should die, I think. It is misleading.

spin_lock(tsk->sighand->siglock) is simply wrong unless tsk == current.

And dequeue_signal() assumes that tsk == current too, otherwise
recalc_sigpending() is wrong.

Oleg.

2014-02-15 15:22:59

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:

> 1. info->q can be already freed if SIGQUEUE_PREALLOC.
>
> Once get_signal_to_deliver() or any other caller drops ->siglock
> another thread can do sys_timer_delete()->sigqueue_free().

How the devil would it find the sucker? It's off the list already.

> 2. We need to move do_schedule_next_timer() from dequeue_signal()
> here.
>
> Otherwise ->q can be reused/overwritten by the next send_sigqueue()
> right affter ->siglock is dropped.

Ditto. We rip them out of queue on collect_signal(); the only thing we do
not do is actual __sigqueue_free().

2014-02-15 15:33:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
>
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> >
> > Once get_signal_to_deliver() or any other caller drops ->siglock
> > another thread can do sys_timer_delete()->sigqueue_free().
>
> How the devil would it find the sucker?

It simply frees the SIGQUEUE_PREALLOC sigqueue, k_itimer->sigq.

> It's off the list already.

Exactly, list_empty(q->list) == T. So release_posix_timer()->sigqueue_free()
assumes we can safely free it.

> > 2. We need to move do_schedule_next_timer() from dequeue_signal()
> > here.
> >
> > Otherwise ->q can be reused/overwritten by the next send_sigqueue()
> > right affter ->siglock is dropped.
>
> Ditto. We rip them out of queue on collect_signal();

Yes, and dequeue_signal()->do_schedule_next_timer() can trigger another
send_sigqueue() which uses the same SIGQUEUE_PREALLOC sigqueue once we
drop ->siglock.

This is not that bad, but at least ->si_overrun can be overwritten
before __setup_rt_frame()->copy_siginfo_to_user().

Oleg.

2014-02-15 15:36:37

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 03:22:51PM +0000, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
>
> > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> >
> > Once get_signal_to_deliver() or any other caller drops ->siglock
> > another thread can do sys_timer_delete()->sigqueue_free().
>
> How the devil would it find the sucker? It's off the list already.

Ouch... I think I see what you mean. Let me see if I got it right:
timer->sigq is *not* freed by collect_signal(); it's done by
release_posix_timer() instead, under siglock. Frankly, this
/*
* If it is queued it will be freed when dequeued,
* like the "regular" sigqueue.
*/
if (!list_empty(&q->list))
q = NULL;
in sigqueue_free() smells like it's asking for races. Sigh...

2014-02-15 15:58:45

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 03:36:31PM +0000, Al Viro wrote:
> On Sat, Feb 15, 2014 at 03:22:51PM +0000, Al Viro wrote:
> > On Sat, Feb 15, 2014 at 03:27:00PM +0100, Oleg Nesterov wrote:
> >
> > > 1. info->q can be already freed if SIGQUEUE_PREALLOC.
> > >
> > > Once get_signal_to_deliver() or any other caller drops ->siglock
> > > another thread can do sys_timer_delete()->sigqueue_free().
> >
> > How the devil would it find the sucker? It's off the list already.
>
> Ouch... I think I see what you mean. Let me see if I got it right:
> timer->sigq is *not* freed by collect_signal(); it's done by
> release_posix_timer() instead, under siglock. Frankly, this
> /*
> * If it is queued it will be freed when dequeued,
> * like the "regular" sigqueue.
> */
> if (!list_empty(&q->list))
> q = NULL;
> in sigqueue_free() smells like it's asking for races. Sigh...

So basically we want a different condition for "can we just go ahead and
free that sucker", right? Instead of "it's on the list, shan't free it"
it ought to be something like "it's on the list or it is referenced by
ksiginfo". Locking will be interesting, though... ;-/

BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
something blocked, shoves it back with PTRACE_SETSIGINFO and does
PTRACE_CONT with that new signal number? Would we get two sigqueue instances
with the same ->si_tid, one of them matching the timer->sigq and another
- not?

2014-02-15 16:59:58

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 03:58:38PM +0000, Al Viro wrote:

> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number? Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

I wonder if it would be simpler to use the pointer *only* for si_code < 0
case. It still gives us ksiginfo_t much smaller than siginfo_t, avoids
all the mess with timers and AFAICS results in less intrusive patch.
IOW, the split between "we know what that sucker is" and "completely
opaque shit the userland has given us".

I'll try to put together something along those lines and see how well does
that work...

2014-02-15 17:44:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/15, Al Viro wrote:
>
> > Ouch... I think I see what you mean. Let me see if I got it right:
> > timer->sigq is *not* freed by collect_signal(); it's done by
> > release_posix_timer() instead, under siglock. Frankly, this
> > /*
> > * If it is queued it will be freed when dequeued,
> > * like the "regular" sigqueue.
> > */
> > if (!list_empty(&q->list))
> > q = NULL;
> > in sigqueue_free() smells like it's asking for races. Sigh...

This is protected by ->siglock, should be safe...

> So basically we want a different condition for "can we just go ahead and
> free that sucker", right? Instead of "it's on the list, shan't free it"
> it ought to be something like "it's on the list or it is referenced by
> ksiginfo". Locking will be interesting, though... ;-/

I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.


> BTW, I really wonder how does that stuff interact with PTRACE_SETSIGINFO.
> What happens if tracer does PTRACE_GETSIGINFO, changes ->si_signo to
> something blocked, shoves it back with PTRACE_SETSIGINFO and does
> PTRACE_CONT with that new signal number? Would we get two sigqueue instances
> with the same ->si_tid, one of them matching the timer->sigq and another
> - not?

Or the task sends a SI_TIMER info to itself via sys_rt_sigqueueinfo().

Afaics, nothing really bad can happen, I mean the kernel should not
crash or something like this. do_schedule_next_timer() can be fooled,
but at least lock_timer() can only succeed if this process actually
has a timer with the same timer_id. This sigqueue != timer->sigq, but
I think this doesn't matter, posix_timer_event() will use timer->sigq
anyway.

Oleg.

2014-02-15 18:05:27

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > So basically we want a different condition for "can we just go ahead and
> > free that sucker", right? Instead of "it's on the list, shan't free it"
> > it ought to be something like "it's on the list or it is referenced by
> > ksiginfo". Locking will be interesting, though... ;-/
>
> I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.

The trouble being, we might end up with
Q picked by collect_signal and and stuff into ksiginfo
Q resubmitted by timer code
Q picked by *another* thread into its ksiginfo
the first thread finally done with signal
and at that point we still have a reference in the second thread's ksiginfo.
Hell knows - my first reflex in that kind of situation is to replace
that flag with refcount, so that timer code would hold a reference from
timer_create(2) to timer_delete(2), send_sigqueue() would bump it and
dismiss_siginfo() - drop the sucker. But that means either grabbing
siglock in dismiss_siginfo() or making the counter atomic; either way
it's a cacheline ping-pong. Atomic counter is less painful in that respect -
it would be right next to the list, so we dirty that cacheline anyway...

I'm still trying another approach (slightly bigger ksiginfo used to store
all variants with si_code >= 0), but it has messiness of its own; we'll
see how it goes...

2014-02-15 18:45:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/15, Al Viro wrote:
>
> On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > So basically we want a different condition for "can we just go ahead and
> > > free that sucker", right? Instead of "it's on the list, shan't free it"
> > > it ought to be something like "it's on the list or it is referenced by
> > > ksiginfo". Locking will be interesting, though... ;-/
> >
> > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.
>
> The trouble being, we might end up with
> Q picked by collect_signal and and stuff into ksiginfo
> Q resubmitted by timer code

In this case the timer code should simply inc ->si_overrun and do nothing.

IOW, list_empty() should be turned into is_queued(), and is_queued()
should be true until dismiss_siginfo() which should also do
do_schedule_next_timer(). I think.

Oleg.

2014-02-15 22:24:06

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Fri, Feb 14, 2014 at 11:01:23AM -0500, Dave Jones wrote:
> On Fri, Feb 14, 2014 at 11:24:27AM +1100, Dave Chinner wrote:
>
> > > I can fix this one easily - we already have a workqueue for doing
> > > async log pushes (will split the stack between xlog_cil_force_lsn
> > > and xlog_cil_push), but the reason we haven't used it for synchronous
> > > log forces is that screws up fsync performance on CFQ. We don't
> > > recommend CFQ with XFS anyway, so I think I'll make this change
> > > anyway.
> >
> > Dave, the patch below should chop off the stack usage from
> > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> > Can you given this a run?
>
> Looks like it's survived an overnight run..

Great.

One thing that has been puzzling me is why I'm seeing stack usage
reported that is way above what we declare locally on the stack.
e.g.

29) 2048 224 xfs_da_grow_inode_int+0xbb/0x340
30) 1824 96 xfs_dir2_grow_inode+0x63/0x110
31) 1728 208 xfs_dir2_sf_to_block+0xe7/0x5e0
32) 1520 144 xfs_dir2_sf_addname+0x115/0x5c0
33) 1376 96 xfs_dir_createname+0x164/0x1a0
34) 1280 224 xfs_create+0x536/0x660
35) 1056 128 xfs_vn_mknod+0xc8/0x1d0

I pulled 128 bytes out of xfs_dir_createname by allocating the
xfs_da_args structure, but I still couldn't reconcile the amount of
stack use being reported with the amount used by locally declared
variables (even considering in-lined leaf functions). For example:

Locally
Declared Used function
88 224 xfs_da_grow_inode_int
32 96 xfs_dir2_grow_inode
168 208 xfs_dir2_sf_to_block
56 144 xfs_dir2_sf_addname
16 96 xfs_dir_createname
120 224 xfs_create
52 128 xfs_vn_mknod

There's a pretty massive difference between the actual stack usage
of the local variables and the amount of stack being used by the
compiled code.

What it appears to be is that the compiler is pushing 6-10 registers
to the stack on every function call. So a function that only has 3
local variables and does very little but allocate a structure and
call other functions saves an 6 registers to the stack before it
starts:

Dump of assembler code for function xfs_dir_createname:
214 {
0xffffffff814d7380 <+0>: callq 0xffffffff81cf0940
0xffffffff814d7385 <+5>: push %rbp
0xffffffff814d7386 <+6>: mov %rsp,%rbp
0xffffffff814d7389 <+9>: sub $0x50,%rsp
0xffffffff814d738d <+13>: mov %rbx,-0x28(%rbp)
0xffffffff814d7391 <+17>: mov %rdi,%rbx
0xffffffff814d7394 <+20>: mov %r12,-0x20(%rbp)
0xffffffff814d7398 <+24>: mov %rcx,%r12
0xffffffff814d739b <+27>: mov %r13,-0x18(%rbp)
0xffffffff814d739f <+31>: mov %rsi,%r13
0xffffffff814d73a5 <+37>: mov %r14,-0x10(%rbp)
0xffffffff814d73a9 <+41>: mov %rdx,%r14
0xffffffff814d73ac <+44>: mov %r15,-0x8(%rbp)
0xffffffff814d73b0 <+48>: mov %r8,%r15
0xffffffff814d73b7 <+55>: mov %r9,-0x48(%rbp)
.....

If this is typical across the call chain (appears to be from my
quick survey) then we are averaging 40-50 bytes of stack per
call for saved registers. That's a lot of stack space we can't
directly control the usage of, especially when we are talking about
call chains that can get 50+ functions deep...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-02-15 22:28:58

by Dave Jones

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sun, Feb 16, 2014 at 09:23:56AM +1100, Dave Chinner wrote:

> There's a pretty massive difference between the actual stack usage
> of the local variables and the amount of stack being used by the
> compiled code.
>
> What it appears to be is that the compiler is pushing 6-10 registers
> to the stack on every function call. So a function that only has 3
> local variables and does very little but allocate a structure and
> call other functions saves an 6 registers to the stack before it
> starts:

I've got a shitload of debug options enabled, which may explain it.
Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

Dave

2014-02-15 22:43:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 2:28 PM, Dave Jones <[email protected]> wrote:
>
> I've got a shitload of debug options enabled, which may explain it.
> Or perhaps that new STACK_PROTECTOR_STRONG stuff ?

Well, a lot of it is just the callee-saved registers. The compiler
will tend to preferentially allocate registers in the callee-trashed
registers, but if the function isn't a leaf function, any registers
that are live around a function call will have to be saved somewhere -
either explicitly around the function call, or - more likely - in
callee-saved registers that then get saved in the prologue/epilogue of
the function. And this will happen even in leaf functions when there
is enough register pressure that the callee-trashed registers aren't
sufficient (which is pretty common).

So saving 5-6 registers on the stack (in addition to any actual stack
frame) is pretty much the norm for anything but the very simplest
cases.

But yeah, I'm sure some config options make it worse.
STACK_PROTECTOR_STRONG could easily be one of those.

Linus

2014-02-15 23:50:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

[ Added linux-mm to the participants list ]

On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <[email protected]> wrote:
>
> Dave, the patch below should chop off the stack usage from
> xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> Can you given this a run?

Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
people, your parents were some seriously boring people, were they not?
We've got too many Dave's around), but DaveC earlier pointed out that
pretty much any memory allocation path can end up using 3kB of stack
even without XFS being involved.

Which does bring up the question whether we should look (once more) at
the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
reclaim..

Direct reclaim historically used to be an important throttling
mechanism, and I used to not be a fan of trying to avoid direct
reclaim. But the stack depth issue really looks to be pretty bad, and
I think we've gotten better at throttling explicitly, so..

I *think* we already limit filesystem writeback to just kswapd (in
shrink_page_list()), but DaveC posted a backtrace that goes through
do_try_to_free_pages() to shrink_slab(), and through there to the
filesystem and then IO. That looked like a disaster.

And that's because (if I read things right) shrink_page_list() limits
filesystem page writeback to kswapd, but not swap pages. Which I think
probably made more sense back in the days than it does now (I
certainly *hope* that swapping is less important today than it was,
say, ten years ago)

So I'm wondering whether we should remove that page_is_file_cache()
check from shrink_page_list()?

And then there is that whole shrink_slab() case...

Linus

2014-02-17 16:57:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/15, Oleg Nesterov wrote:
>
> On 02/15, Al Viro wrote:
> >
> > On Sat, Feb 15, 2014 at 06:43:45PM +0100, Oleg Nesterov wrote:
> > > > So basically we want a different condition for "can we just go ahead and
> > > > free that sucker", right? Instead of "it's on the list, shan't free it"
> > > > it ought to be something like "it's on the list or it is referenced by
> > > > ksiginfo". Locking will be interesting, though... ;-/
> > >
> > > I guess yes... send_sigqueue() checks list_empty() too, probably nobody else.
> >
> > The trouble being, we might end up with
> > Q picked by collect_signal and and stuff into ksiginfo
> > Q resubmitted by timer code
>
> In this case the timer code should simply inc ->si_overrun and do nothing.
>
> IOW, list_empty() should be turned into is_queued(), and is_queued()
> should be true until dismiss_siginfo() which should also do
> do_schedule_next_timer(). I think.

No, this is even more complicated. We also need do_schedule_next_timer()
to calculate ->si_overrun we are going to report, I missed this...

Looks like, this is all is really nasty. Actually, I think siginfo on
stack is not that bad if we are going to do handle_signal() or restart,
perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
Something like below.

Oleg.


diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9e5de68..52f16f9 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -684,6 +684,52 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
test_thread_flag(TIF_IA32) ? __NR_ia32_restart_syscall : __NR_restart_syscall
#endif /* CONFIG_X86_32 */

+static noinline int process_signal(struct pt_regs *regs, siginfo_t **pinfo)
+{
+ struct ksignal ksig;
+
+ switch (get_signal(&ksig)) {
+ case SIG_DUMP:
+ *pinfo = kmalloc(sizeof(siginfo_t), GFP_KERNEL);
+ if (*pinfo)
+ copy_siginfo(*pinfo, &ksig.info);
+
+ case SIG_EXIT:
+ return ksig.info.si_signo;
+
+ default:
+ handle_signal(&ksig, regs);
+ break;
+
+ case 0:
+ /* Did we come from a system call? */
+ if (syscall_get_nr(current, regs) >= 0) {
+ /* Restart the system call - no handlers present */
+ switch (syscall_get_error(current, regs)) {
+ case -ERESTARTNOHAND:
+ case -ERESTARTSYS:
+ case -ERESTARTNOINTR:
+ regs->ax = regs->orig_ax;
+ regs->ip -= 2;
+ break;
+
+ case -ERESTART_RESTARTBLOCK:
+ regs->ax = NR_restart_syscall;
+ regs->ip -= 2;
+ break;
+ }
+ }
+
+ /*
+ * If there's no signal to deliver, we just put the saved sigmask
+ * back.
+ */
+ restore_saved_sigmask();
+ }
+
+ return 0;
+}
+
/*
* Note that 'init' is a special process: it doesn't get signals it doesn't
* want to handle. Thus you cannot kill init even with a SIGKILL even by
@@ -691,37 +737,16 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
*/
static void do_signal(struct pt_regs *regs)
{
- struct ksignal ksig;
+ siginfo_t *info = NULL;
+ int sig = process_signal(regs, &info);

- if (get_signal(&ksig)) {
- /* Whee! Actually deliver the signal. */
- handle_signal(&ksig, regs);
- return;
- }
-
- /* Did we come from a system call? */
- if (syscall_get_nr(current, regs) >= 0) {
- /* Restart the system call - no handlers present */
- switch (syscall_get_error(current, regs)) {
- case -ERESTARTNOHAND:
- case -ERESTARTSYS:
- case -ERESTARTNOINTR:
- regs->ax = regs->orig_ax;
- regs->ip -= 2;
- break;
-
- case -ERESTART_RESTARTBLOCK:
- regs->ax = NR_restart_syscall;
- regs->ip -= 2;
- break;
+ if (sig) {
+ if (info) {
+ do_coredump(info);
+ kfree(info);
}
+ do_group_exit(sig);
}
-
- /*
- * If there's no signal to deliver, we just put the saved sigmask
- * back.
- */
- restore_saved_sigmask();
}

/*
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 2ac423b..33b5e04 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -285,6 +285,9 @@ struct ksignal {
int sig;
};

+#define SIG_EXIT -1
+#define SIG_DUMP -2
+
extern int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka, struct pt_regs *regs, void *cookie);
extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
extern void signal_delivered(int sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs, int stepping);
@@ -299,7 +302,7 @@ extern void exit_signals(struct task_struct *tsk);
struct ksignal *p = (ksig); \
p->sig = get_signal_to_deliver(&p->info, &p->ka, \
signal_pt_regs(), NULL);\
- p->sig > 0; \
+ p->sig; \
})

extern struct kmem_cache *sighand_cachep;
diff --git a/kernel/signal.c b/kernel/signal.c
index 52f881d..8a4c4a3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2353,22 +2353,11 @@ relock:
if (print_fatal_signals)
print_fatal_signal(info->si_signo);
proc_coredump_connector(current);
- /*
- * If it was able to dump core, this kills all
- * other threads in the group and synchronizes with
- * their demise. If we lost the race with another
- * thread getting here, it set group_exit_code
- * first and our do_group_exit call below will use
- * that value and ignore the one we pass it.
- */
- do_coredump(info);
+ return SIG_DUMP;
}

- /*
- * Death signals, no core dump.
- */
- do_group_exit(info->si_signo);
- /* NOTREACHED */
+ return SIG_EXIT;
+
}
spin_unlock_irq(&sighand->siglock);
return signr;

2014-02-17 17:41:08

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:

> Looks like, this is all is really nasty. Actually, I think siginfo on
> stack is not that bad if we are going to do handle_signal() or restart,
> perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> Something like below.

Yecchhhh... You've just broken every architecture other than x86, and to
fix them you'll need to massage every get_signal()/get_signal_to_deliver()
user out there, pulling the logics *out* of kernel/signal.c and into arch/*.
This is just plain wrong.

2014-02-17 17:46:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On 02/17, Al Viro wrote:
>
> On Mon, Feb 17, 2014 at 05:57:35PM +0100, Oleg Nesterov wrote:
>
> > Looks like, this is all is really nasty. Actually, I think siginfo on
> > stack is not that bad if we are going to do handle_signal() or restart,
> > perhaps we can do the extra kmalloc/memcpy/kfree for do_coredump().
> > Something like below.
>
> Yecchhhh... You've just broken every architecture other than x86,

Of course, this is only to explain what I meant.

> and to
> fix them you'll need to massage every get_signal()/get_signal_to_deliver()
> user out there,

Yes.

> pulling the logics *out* of kernel/signal.c and into arch/*.

Not really, I think. Of course this change should be cleanuped. And it
should not require to change all architectures at once.

> This is just plain wrong.

I agree, this change is also ugly.

Oleg.

2014-02-17 17:54:12

by Al Viro

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Mon, Feb 17, 2014 at 06:46:48PM +0100, Oleg Nesterov wrote:
> > pulling the logics *out* of kernel/signal.c and into arch/*.
>
> Not really, I think.

How so? You propose to make all architectures call do_coredump()
themselves, instead of having it done centrally. It's more boilerplate
for all of them to get wrong. And IME any boilerplate in that are
*will* be gotten wrong - I've gone through signal handling on all
architectures more than a few times and every time it caught a pile
of bugs...

2014-02-18 01:28:39

by Dave Chinner

[permalink] [raw]
Subject: Re: 3.14-rc2 XFS backtrace because irqs_disabled.

On Sat, Feb 15, 2014 at 03:50:29PM -0800, Linus Torvalds wrote:
> [ Added linux-mm to the participants list ]
>
> On Thu, Feb 13, 2014 at 4:24 PM, Dave Chinner <[email protected]> wrote:
> >
> > Dave, the patch below should chop off the stack usage from
> > xfs_log_force_lsn() issuing IO by deferring it to the CIL workqueue.
> > Can you given this a run?
>
> Ok, so DaveJ confirmed that DaveC's patch fixes his issue (damn,
> people, your parents were some seriously boring people, were they not?
> We've got too many Dave's around),

It's an exclusive club - we have 'kernel hacker Dave' reunions in
bars around the world. We should get some tshirts made up.... :)

> but DaveC earlier pointed out that
> pretty much any memory allocation path can end up using 3kB of stack
> even without XFS being involved.
>
> Which does bring up the question whether we should look (once more) at
> the VM direct-reclaim path, and try to avoid GFP_FS/IO direct
> reclaim..

We do that mostly already, but GFP_KERNEL allows swap IO and that's
where the deepest stack I saw came from.

Even if we don't allow IO at all, we're still going to see stack
usage of 2-2.5k in direct reclaim. e.g. invalidate a page and enter
the rmap code. The rmap is protected by a mutex, so if we fail to
get that we have about 1.2k of stack consumed from there and that is
on top of the the allocator/reclaim that has already consumes ~1k of
stack...

> Direct reclaim historically used to be an important throttling
> mechanism, and I used to not be a fan of trying to avoid direct
> reclaim. But the stack depth issue really looks to be pretty bad, and
> I think we've gotten better at throttling explicitly, so..
>
> I *think* we already limit filesystem writeback to just kswapd (in
> shrink_page_list()), but DaveC posted a backtrace that goes through
> do_try_to_free_pages() to shrink_slab(), and through there to the
> filesystem and then IO. That looked like a disaster.

Right, that's an XFS problem, and I'm working on fixing it. The
Patch I sent to DaveJ fixes the worst case, but I need to make it
completely IO-less while still retaining the throttling the IO gives
us.

> And that's because (if I read things right) shrink_page_list() limits
> filesystem page writeback to kswapd, but not swap pages. Which I think
> probably made more sense back in the days than it does now (I
> certainly *hope* that swapping is less important today than it was,
> say, ten years ago)
>
> So I'm wondering whether we should remove that page_is_file_cache()
> check from shrink_page_list()?

The thing is, the stack usage from the swap IO path is pretty well
bound - it's just the worst case stack of issuing IO. We know it
won't recurse into direct reclaim, so mempool allocation and
blocking is all we need to consider. Compare that to a filesystem
which may need to allocate extents and hence do transactions and
split btrees and read metadata and allocate large amounts of memory
even before it gets to the IO layers.

Hence I suspect that we could do a simple thing like only allow swap
if there's more than half the stack available in the current reclaim
context. Because, let's face it, if the submit_bio path is consuming
more than half the available stack then we're totally screwed from a
filesystem perspective....

> And then there is that whole shrink_slab() case...

I think with shrinkers we just need to be more careful. The XFS
behaviour is all my fault, and I should have known better that to
design code that requires IO in the direct reclaim path. :/

Cheers,

Dave.
--
Dave Chinner
[email protected]