2007-12-28 17:41:48

by Gabor Gombas

[permalink] [raw]
Subject: Oops involving RFCOMM and sysfs

Hi,

I'm using my phone as a GPRS modem over Bluetooth. Sometimes Bluetooth
on the phone seems to stall and the phone has to be switched off & on to
get it back to a sane state. During this I sometimes get the following
Oops (this is kernel 2.6.24-rc6 on x86_64):

Dec 28 17:52:11 host kernel: Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:
Dec 28 17:52:11 host kernel: [<ffffffff8046e0d5>] mutex_lock+0x10/0x1d
Dec 28 17:52:11 host kernel: PGD 920b0067 PUD a7aba067 PMD 0
Dec 28 17:52:11 host kernel: Oops: 0002 [1]
Dec 28 17:52:11 host kernel: CPU 0
Dec 28 17:52:11 host kernel: Modules linked in: cpufreq_ondemand ppp_deflate zlib_deflate zlib_inflate bsd_comp ppp_async crc_ccitt ppp_generic slhc binfmt_misc rfcomm l2cap nfsd auth_rpcgss exportfs ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt dm_snapshot dm_mirror saa7134_alsa radeon hwmon_vid eeprom hci_usb bluetooth usb_storage tuner tea5767 tda8290 tuner_simple mt20xx tea5761 sg snd_intel8x0 saa7134 firewire_ohci snd_ac97_codec firewire_core videobuf_dma_sg videobuf_core ir_kbd_i2c ac97_bus crc_itu_t ir_common snd_pcm ehci_hcd ohci_hcd parport_pc parport sky2 sr_mod forcedeth snd_timer snd_page_alloc cdrom floppy
Dec 28 17:52:11 host kernel: Pid: 4381, comm: pppd Not tainted 2.6.24-rc6 #1
Dec 28 17:52:11 host kernel: RIP: 0010:[<ffffffff8046e0d5>] [<ffffffff8046e0d5>] mutex_lock+0x10/0x1d
Dec 28 17:52:11 host kernel: RSP: 0018:ffff81009402fd08 EFLAGS: 00010246
Dec 28 17:52:11 host kernel: RAX: 0000000000000000 RBX: 00000000000000b8 RCX: 0000000000000000
Dec 28 17:52:11 host kernel: RDX: ffff81009402ffd8 RSI: ffffffff80639d70 RDI: 00000000000000b8
Dec 28 17:52:11 host kernel: RBP: ffffffff8058adb0 R08: 6390dfaa7e3bd395 R09: ffffffff8050597e
Dec 28 17:52:11 host kernel: R10: 0000000000004000 R11: 0000003000000018 R12: ffff8100b4d75e60
Dec 28 17:52:11 host kernel: R13: 00000000fffffff4 R14: ffff8100abe9e180 R15: ffff8100a7a10900
Dec 28 17:52:11 host kernel: FS: 00002ab1bddc3ac0(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000
Dec 28 17:52:11 host kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 28 17:52:11 host kernel: CR2: 00000000000000b8 CR3: 000000009400e000 CR4: 00000000000006e0
Dec 28 17:52:11 host kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 28 17:52:11 host kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 28 17:52:11 host kernel: Process pppd (pid: 4381, threadinfo ffff81009402e000, task ffff81009403d060)
Dec 28 17:52:11 host kernel: Stack: ffff81009403d060 ffffffff80285555 ffff810093734d48 ffffffff802aefa5
Dec 28 17:52:11 host kernel: ffff8100bf9e1c80 ffff8100b4d75e60 ffff810060720600 ffffffff802af1ba
Dec 28 17:52:11 host kernel: ffff8100b6e5ced0 ffff8100b6e5ced0 ffff8100bfa11c00 ffff810060720600
Dec 28 17:52:11 host kernel: Call Trace:
Dec 28 17:52:11 host kernel: [<ffffffff80285555>] dput+0x26/0x103
Dec 28 17:52:11 host kernel: [<ffffffff802aefa5>] sysfs_get_dentry+0x45/0x8f
Dec 28 17:52:11 host kernel: [<ffffffff802af1ba>] sysfs_move_dir+0x63/0x204
Dec 28 17:52:11 host kernel: [<ffffffff803006e5>] kobject_move+0xba/0x110
Dec 28 17:52:11 host kernel: [<ffffffff80368640>] device_move+0x59/0x111
Dec 28 17:52:11 host kernel: [<ffffffff8828841c>] :rfcomm:rfcomm_tty_close+0x2f/0x74
Dec 28 17:52:11 host kernel: [<ffffffff803442ff>] release_dev+0x212/0x5e2
Dec 28 17:52:11 host kernel: [<ffffffff80288801>] mntput_no_expire+0x1f/0x6f
Dec 28 17:52:11 host kernel: [<ffffffff803e845d>] sys_sendto+0x128/0x151
Dec 28 17:52:11 host kernel: [<ffffffff803446db>] tty_release+0xc/0x10
Dec 28 17:52:11 host kernel: [<ffffffff80276f67>] __fput+0xb1/0x16f
Dec 28 17:52:11 host kernel: [<ffffffff802748b5>] filp_close+0x5d/0x65
Dec 28 17:52:11 host kernel: [<ffffffff80275a07>] sys_close+0x73/0xa6
Dec 28 17:52:11 host kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
Dec 28 17:52:11 host kernel:
Dec 28 17:52:11 host kernel:
Dec 28 17:52:11 host kernel: Code: ff 0f 79 05 e8 c9 00 00 00 58 5a 5b c3 41 54 48 8d 47 08 48
Dec 28 17:52:11 host kernel: RIP [<ffffffff8046e0d5>] mutex_lock+0x10/0x1d
Dec 28 17:52:11 host kernel: RSP <ffff81009402fd08>
Dec 28 17:52:11 host kernel: CR2: 00000000000000b8
Dec 28 17:52:11 host kernel: ---[ end trace 12717319afdb56f5 ]---

The bug is also present in 2.6.23.9:

Dec 28 15:30:29 twister kernel: Unable to handle kernel paging request at fffffffffffffffe RIP:
Dec 28 15:30:29 twister kernel: [<ffffffff8027ac13>] dput+0xd/0x103
Dec 28 15:30:29 twister kernel: PGD 203067 PUD 204067 PMD 0
Dec 28 15:30:29 twister kernel: Oops: 0000 [1]
Dec 28 15:30:29 twister kernel: CPU 0
Dec 28 15:30:29 twister kernel: Modules linked in: ppp_deflate zlib_deflate zlib_inflate bsd_comp ppp_async crc_ccitt ppp_generic slhc binfmt_misc rfcomm l2cap nfsd exportfs auth_rpcgss ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt dm_snapshot dm_mirror saa7134_alsa radeon it87 hwmon_vid eeprom hci_usb bluetooth tuner usb_storage saa7134 video_buf ir_kbd_i2c snd_intel8x0 sg firewire_ohci firewire_core ir_common crc_itu_t snd_ac97_codec ac97_bus forcedeth parport_pc parport sky2 ohci_hcd snd_pcm snd_timer snd_page_alloc ehci_hcd sr_mod cdrom floppy
Dec 28 15:30:29 twister kernel: Pid: 14700, comm: pppd Not tainted 2.6.23.9 #1
Dec 28 15:30:29 twister kernel: RIP: 0010:[<ffffffff8027ac13>] [<ffffffff8027ac13>] dput+0xd/0x103
Dec 28 15:30:29 twister kernel: RSP: 0018:ffff810054e5bd28 EFLAGS: 00010282
Dec 28 15:30:29 twister kernel: RAX: fffffffffffffffe RBX: fffffffffffffffe RCX: 0000000000000052
Dec 28 15:30:29 twister kernel: RDX: ffffffff802f6414 RSI: ffff81000411b4e0 RDI: fffffffffffffffe
Dec 28 15:30:29 twister kernel: RBP: ffff8100bfef14d0 R08: ffffffffffffffff R09: 0000000000005401
Dec 28 15:30:29 twister kernel: R10: 00007fff7018a2e0 R11: 0000000000000246 R12: ffff8100b3731948
Dec 28 15:30:29 twister kernel: R13: 0000000000000000 R14: ffff81005a1cb100 R15: fffffffffffffffe
Dec 28 15:30:29 twister kernel: FS: 00002acd3a94cac0(0000) GS:ffffffff80592000(0000) knlGS:00000000f7d786b0
Dec 28 15:30:29 twister kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 28 15:30:29 twister kernel: CR2: fffffffffffffffe CR3: 0000000047f21000 CR4: 00000000000006e0
Dec 28 15:30:29 twister kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 28 15:30:29 twister kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 28 15:30:29 twister kernel: Process pppd (pid: 14700, threadinfo ffff810054e5a000, task ffff81007907ce40)
Dec 28 15:30:29 twister kernel: Stack: 00000000fffffffe ffffffff802a5a81 0000000000000053 ffff81000411b4e0
Dec 28 15:30:29 twister kernel: 0000000000000000 ffff8100b539f8d0 ffff81000411b4e0 ffff8100b04e7060
Dec 28 15:30:29 twister kernel: 00000000fffffff4 ffff81005a1cb100 ffff8100b09db800 ffffffff802f6448
Dec 28 15:30:29 twister kernel: Call Trace:
Dec 28 15:30:29 twister kernel: [<ffffffff802a5a81>] sysfs_move_dir+0x1d2/0x1ed
Dec 28 15:30:29 twister kernel: [<ffffffff802f6448>] kobject_move+0xba/0x110
Dec 28 15:30:29 twister kernel: [<ffffffff8035e0f0>] device_move+0x59/0x111
Dec 28 15:30:29 twister kernel: [<ffffffff8827c382>] :rfcomm:rfcomm_tty_close+0x2f/0x74
Dec 28 15:30:29 twister kernel: [<ffffffff803397ff>] release_dev+0x212/0x5e2
Dec 28 15:30:29 twister kernel: [<ffffffff802d4206>] inode_has_perm+0x65/0x72
Dec 28 15:30:29 twister kernel: [<ffffffff802d42a7>] file_has_perm+0x94/0xa3
Dec 28 15:30:29 twister kernel: [<ffffffff80339bdb>] tty_release+0xc/0x11
Dec 28 15:30:29 twister kernel: [<ffffffff8026c578>] __fput+0xb1/0x177
Dec 28 15:30:29 twister kernel: [<ffffffff80269ecc>] filp_close+0x5d/0x65
Dec 28 15:30:29 twister kernel: [<ffffffff8026b01a>] sys_close+0x72/0xa5
Dec 28 15:30:29 twister kernel: [<ffffffff8020b5cc>] tracesys+0xdc/0xe1
Dec 28 15:30:29 twister kernel:
Dec 28 15:30:29 twister kernel:
Dec 28 15:30:29 twister kernel: Code: 83 3b 01 75 05 e8 e8 29 1e 00 48 c7 c6 08 de 61 80 48 89 df
Dec 28 15:30:29 twister kernel: RIP [<ffffffff8027ac13>] dput+0xd/0x103
Dec 28 15:30:29 twister kernel: RSP <ffff810054e5bd28>
Dec 28 15:30:29 twister kernel: CR2: fffffffffffffffe

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------


2007-12-29 08:07:19

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Dec 29, 2007 1:32 AM, Gabor Gombas <[email protected]> wrote:
> Hi,
>
> I'm using my phone as a GPRS modem over Bluetooth. Sometimes Bluetooth
> on the phone seems to stall and the phone has to be switched off & on to
> get it back to a sane state. During this I sometimes get the following
> Oops (this is kernel 2.6.24-rc6 on x86_64):
>
> Dec 28 17:52:11 host kernel: Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:
> Dec 28 17:52:11 host kernel: [<ffffffff8046e0d5>] mutex_lock+0x10/0x1d
> Dec 28 17:52:11 host kernel: PGD 920b0067 PUD a7aba067 PMD 0
> Dec 28 17:52:11 host kernel: Oops: 0002 [1]
> Dec 28 17:52:11 host kernel: CPU 0
> Dec 28 17:52:11 host kernel: Modules linked in: cpufreq_ondemand ppp_deflate zlib_deflate zlib_inflate bsd_comp ppp_async crc_ccitt ppp_generic slhc binfmt_misc rfcomm l2cap nfsd auth_rpcgss exportfs ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt dm_snapshot dm_mirror saa7134_alsa radeon hwmon_vid eeprom hci_usb bluetooth usb_storage tuner tea5767 tda8290 tuner_simple mt20xx tea5761 sg snd_intel8x0 saa7134 firewire_ohci snd_ac97_codec firewire_core videobuf_dma_sg videobuf_core ir_kbd_i2c ac97_bus crc_itu_t ir_common snd_pcm ehci_hcd ohci_hcd parport_pc parport sky2 sr_mod forcedeth snd_timer snd_page_alloc cdrom floppy
> Dec 28 17:52:11 host kernel: Pid: 4381, comm: pppd Not tainted 2.6.24-rc6 #1
> Dec 28 17:52:11 host kernel: RIP: 0010:[<ffffffff8046e0d5>] [<ffffffff8046e0d5>] mutex_lock+0x10/0x1d
> Dec 28 17:52:11 host kernel: RSP: 0018:ffff81009402fd08 EFLAGS: 00010246
> Dec 28 17:52:11 host kernel: RAX: 0000000000000000 RBX: 00000000000000b8 RCX: 0000000000000000
> Dec 28 17:52:11 host kernel: RDX: ffff81009402ffd8 RSI: ffffffff80639d70 RDI: 00000000000000b8
> Dec 28 17:52:11 host kernel: RBP: ffffffff8058adb0 R08: 6390dfaa7e3bd395 R09: ffffffff8050597e
> Dec 28 17:52:11 host kernel: R10: 0000000000004000 R11: 0000003000000018 R12: ffff8100b4d75e60
> Dec 28 17:52:11 host kernel: R13: 00000000fffffff4 R14: ffff8100abe9e180 R15: ffff8100a7a10900
> Dec 28 17:52:11 host kernel: FS: 00002ab1bddc3ac0(0000) GS:ffffffff805b0000(0000) knlGS:0000000000000000
> Dec 28 17:52:11 host kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Dec 28 17:52:11 host kernel: CR2: 00000000000000b8 CR3: 000000009400e000 CR4: 00000000000006e0
> Dec 28 17:52:11 host kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec 28 17:52:11 host kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Dec 28 17:52:11 host kernel: Process pppd (pid: 4381, threadinfo ffff81009402e000, task ffff81009403d060)
> Dec 28 17:52:11 host kernel: Stack: ffff81009403d060 ffffffff80285555 ffff810093734d48 ffffffff802aefa5
> Dec 28 17:52:11 host kernel: ffff8100bf9e1c80 ffff8100b4d75e60 ffff810060720600 ffffffff802af1ba
> Dec 28 17:52:11 host kernel: ffff8100b6e5ced0 ffff8100b6e5ced0 ffff8100bfa11c00 ffff810060720600
> Dec 28 17:52:11 host kernel: Call Trace:
> Dec 28 17:52:11 host kernel: [<ffffffff80285555>] dput+0x26/0x103
> Dec 28 17:52:11 host kernel: [<ffffffff802aefa5>] sysfs_get_dentry+0x45/0x8f
> Dec 28 17:52:11 host kernel: [<ffffffff802af1ba>] sysfs_move_dir+0x63/0x204
> Dec 28 17:52:11 host kernel: [<ffffffff803006e5>] kobject_move+0xba/0x110
> Dec 28 17:52:11 host kernel: [<ffffffff80368640>] device_move+0x59/0x111
> Dec 28 17:52:11 host kernel: [<ffffffff8828841c>] :rfcomm:rfcomm_tty_close+0x2f/0x74
> Dec 28 17:52:11 host kernel: [<ffffffff803442ff>] release_dev+0x212/0x5e2
> Dec 28 17:52:11 host kernel: [<ffffffff80288801>] mntput_no_expire+0x1f/0x6f
> Dec 28 17:52:11 host kernel: [<ffffffff803e845d>] sys_sendto+0x128/0x151
> Dec 28 17:52:11 host kernel: [<ffffffff803446db>] tty_release+0xc/0x10
> Dec 28 17:52:11 host kernel: [<ffffffff80276f67>] __fput+0xb1/0x16f
> Dec 28 17:52:11 host kernel: [<ffffffff802748b5>] filp_close+0x5d/0x65
> Dec 28 17:52:11 host kernel: [<ffffffff80275a07>] sys_close+0x73/0xa6
> Dec 28 17:52:11 host kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
> Dec 28 17:52:11 host kernel:
> Dec 28 17:52:11 host kernel:
> Dec 28 17:52:11 host kernel: Code: ff 0f 79 05 e8 c9 00 00 00 58 5a 5b c3 41 54 48 8d 47 08 48
> Dec 28 17:52:11 host kernel: RIP [<ffffffff8046e0d5>] mutex_lock+0x10/0x1d
> Dec 28 17:52:11 host kernel: RSP <ffff81009402fd08>
> Dec 28 17:52:11 host kernel: CR2: 00000000000000b8
> Dec 28 17:52:11 host kernel: ---[ end trace 12717319afdb56f5 ]---
>
> The bug is also present in 2.6.23.9:
>
> Dec 28 15:30:29 twister kernel: Unable to handle kernel paging request at fffffffffffffffe RIP:
> Dec 28 15:30:29 twister kernel: [<ffffffff8027ac13>] dput+0xd/0x103
> Dec 28 15:30:29 twister kernel: PGD 203067 PUD 204067 PMD 0
> Dec 28 15:30:29 twister kernel: Oops: 0000 [1]
> Dec 28 15:30:29 twister kernel: CPU 0
> Dec 28 15:30:29 twister kernel: Modules linked in: ppp_deflate zlib_deflate zlib_inflate bsd_comp ppp_async crc_ccitt ppp_generic slhc binfmt_misc rfcomm l2cap nfsd exportfs auth_rpcgss ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt dm_snapshot dm_mirror saa7134_alsa radeon it87 hwmon_vid eeprom hci_usb bluetooth tuner usb_storage saa7134 video_buf ir_kbd_i2c snd_intel8x0 sg firewire_ohci firewire_core ir_common crc_itu_t snd_ac97_codec ac97_bus forcedeth parport_pc parport sky2 ohci_hcd snd_pcm snd_timer snd_page_alloc ehci_hcd sr_mod cdrom floppy
> Dec 28 15:30:29 twister kernel: Pid: 14700, comm: pppd Not tainted 2.6.23.9 #1
> Dec 28 15:30:29 twister kernel: RIP: 0010:[<ffffffff8027ac13>] [<ffffffff8027ac13>] dput+0xd/0x103
> Dec 28 15:30:29 twister kernel: RSP: 0018:ffff810054e5bd28 EFLAGS: 00010282
> Dec 28 15:30:29 twister kernel: RAX: fffffffffffffffe RBX: fffffffffffffffe RCX: 0000000000000052
> Dec 28 15:30:29 twister kernel: RDX: ffffffff802f6414 RSI: ffff81000411b4e0 RDI: fffffffffffffffe
> Dec 28 15:30:29 twister kernel: RBP: ffff8100bfef14d0 R08: ffffffffffffffff R09: 0000000000005401
> Dec 28 15:30:29 twister kernel: R10: 00007fff7018a2e0 R11: 0000000000000246 R12: ffff8100b3731948
> Dec 28 15:30:29 twister kernel: R13: 0000000000000000 R14: ffff81005a1cb100 R15: fffffffffffffffe
> Dec 28 15:30:29 twister kernel: FS: 00002acd3a94cac0(0000) GS:ffffffff80592000(0000) knlGS:00000000f7d786b0
> Dec 28 15:30:29 twister kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Dec 28 15:30:29 twister kernel: CR2: fffffffffffffffe CR3: 0000000047f21000 CR4: 00000000000006e0
> Dec 28 15:30:29 twister kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec 28 15:30:29 twister kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Dec 28 15:30:29 twister kernel: Process pppd (pid: 14700, threadinfo ffff810054e5a000, task ffff81007907ce40)
> Dec 28 15:30:29 twister kernel: Stack: 00000000fffffffe ffffffff802a5a81 0000000000000053 ffff81000411b4e0
> Dec 28 15:30:29 twister kernel: 0000000000000000 ffff8100b539f8d0 ffff81000411b4e0 ffff8100b04e7060
> Dec 28 15:30:29 twister kernel: 00000000fffffff4 ffff81005a1cb100 ffff8100b09db800 ffffffff802f6448
> Dec 28 15:30:29 twister kernel: Call Trace:
> Dec 28 15:30:29 twister kernel: [<ffffffff802a5a81>] sysfs_move_dir+0x1d2/0x1ed
> Dec 28 15:30:29 twister kernel: [<ffffffff802f6448>] kobject_move+0xba/0x110
> Dec 28 15:30:29 twister kernel: [<ffffffff8035e0f0>] device_move+0x59/0x111
> Dec 28 15:30:29 twister kernel: [<ffffffff8827c382>] :rfcomm:rfcomm_tty_close+0x2f/0x74
> Dec 28 15:30:29 twister kernel: [<ffffffff803397ff>] release_dev+0x212/0x5e2
> Dec 28 15:30:29 twister kernel: [<ffffffff802d4206>] inode_has_perm+0x65/0x72
> Dec 28 15:30:29 twister kernel: [<ffffffff802d42a7>] file_has_perm+0x94/0xa3
> Dec 28 15:30:29 twister kernel: [<ffffffff80339bdb>] tty_release+0xc/0x11
> Dec 28 15:30:29 twister kernel: [<ffffffff8026c578>] __fput+0xb1/0x177
> Dec 28 15:30:29 twister kernel: [<ffffffff80269ecc>] filp_close+0x5d/0x65
> Dec 28 15:30:29 twister kernel: [<ffffffff8026b01a>] sys_close+0x72/0xa5
> Dec 28 15:30:29 twister kernel: [<ffffffff8020b5cc>] tracesys+0xdc/0xe1
> Dec 28 15:30:29 twister kernel:
> Dec 28 15:30:29 twister kernel:
> Dec 28 15:30:29 twister kernel: Code: 83 3b 01 75 05 e8 e8 29 1e 00 48 c7 c6 08 de 61 80 48 89 df
> Dec 28 15:30:29 twister kernel: RIP [<ffffffff8027ac13>] dput+0xd/0x103
> Dec 28 15:30:29 twister kernel: RSP <ffff810054e5bd28>
> Dec 28 15:30:29 twister kernel: CR2: fffffffffffffffe
>

Please try the -mm tree kernel, might have been fixed by :
http://lkml.org/lkml/2007/11/18/141

Regards
dave

2008-01-02 14:48:30

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Sat, Dec 29, 2007 at 04:07:04PM +0800, Dave Young wrote:

> Please try the -mm tree kernel, might have been fixed by :
> http://lkml.org/lkml/2007/11/18/141

I've applied the patch on top of 2.6.24-rc6 (I don't want to run -mm
kernels on this machine). We'll see what happens.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-02 15:16:51

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Sat, Dec 29, 2007 at 04:07:04PM +0800, Dave Young wrote:

> Please try the -mm tree kernel, might have been fixed by :
> http://lkml.org/lkml/2007/11/18/141

Heh, it seems talking about a bug makes it trigger:

Jan 2 16:05:45 twister kernel: Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:
Jan 2 16:05:45 twister kernel: [<ffffffff804720a5>] mutex_lock+0x10/0x1d
Jan 2 16:05:45 twister kernel: PGD bcf6e067 PUD bcee3067 PMD 0
Jan 2 16:05:45 twister kernel: Oops: 0002 [1]
Jan 2 16:05:45 twister kernel: CPU 0
Jan 2 16:05:45 twister kernel: Modules linked in: binfmt_misc rfcomm l2cap nfsd auth_rpcgss exportfs ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt dm_snapshot dm_mirror cpufreq_ondemand saa7134_alsa radeon hwmon_vid eeprom hci_usb bluetooth usb_storage tuner tea5767 tda8290 tuner_simple mt20xx tea5761 sg snd_intel8x0 saa7134 snd_ac97_codec ac97_bus videobuf_dma_sg videobuf_core ir_kbd_i2c sr_mod firewire_ohci firewire_core snd_pcm crc_itu_t ir_common ehci_hcd ohci_hcd cdrom snd_timer snd_page_alloc parport_pc parport sky2 forcedeth floppy
Jan 2 16:05:45 twister kernel: Pid: 5056, comm: cat Not tainted 2.6.24-rc6-dirty #3
Jan 2 16:05:45 twister kernel: RIP: 0010:[<ffffffff804720a5>] [<ffffffff804720a5>] mutex_lock+0x10/0x1d
Jan 2 16:05:45 twister kernel: RSP: 0018:ffff8100bce3fd08 EFLAGS: 00010246
Jan 2 16:05:45 twister kernel: RAX: 0000000000000000 RBX: 00000000000000b8 RCX: 0000000000000000
Jan 2 16:05:45 twister kernel: RDX: ffff8100bce3ffd8 RSI: ffffffff80641d70 RDI: 00000000000000b8
Jan 2 16:05:45 twister kernel: RBP: ffffffff80591db0 R08: 0000000000000000 R09: 00000000000899a2
Jan 2 16:05:45 twister kernel: R10: 0000000000000000 R11: 0000003000000018 R12: ffff8100bcb8ef50
Jan 2 16:05:45 twister kernel: R13: 00000000fffffff4 R14: ffff8100bcfc8e00 R15: ffff8100a370b300
Jan 2 16:05:45 twister kernel: FS: 00002b225d0e56e0(0000) GS:ffffffff805b8000(0000) knlGS:0000000000000000
Jan 2 16:05:45 twister kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 2 16:05:45 twister kernel: CR2: 00000000000000b8 CR3: 0000000095ad1000 CR4: 00000000000006e0
Jan 2 16:05:45 twister kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 2 16:05:45 twister kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 2 16:05:45 twister kernel: Process cat (pid: 5056, threadinfo ffff8100bce3e000, task ffff8100ba931060)
Jan 2 16:05:45 twister kernel: Stack: ffffe2000007e2d8 ffffffff80285555 ffff8100a3387000 ffffffff802aefa5
Jan 2 16:05:45 twister kernel: ffff8100bfa8af50 ffff8100bcb8ef50 ffff8100baab9300 ffffffff802af1ba
Jan 2 16:05:45 twister kernel: ffff8100a342a8d0 ffff8100a342a8d0 ffff8100bfa92dc0 ffff8100baab9300
Jan 2 16:05:45 twister kernel: Call Trace:
Jan 2 16:05:45 twister kernel: [<ffffffff80285555>] dput+0x26/0x103
Jan 2 16:05:45 twister kernel: [<ffffffff802aefa5>] sysfs_get_dentry+0x45/0x8f
Jan 2 16:05:45 twister kernel: [<ffffffff802af1ba>] sysfs_move_dir+0x63/0x204
Jan 2 16:05:45 twister kernel: [<ffffffff803006e5>] kobject_move+0xba/0x110
Jan 2 16:05:45 twister kernel: [<ffffffff80368a00>] device_move+0x59/0x111
Jan 2 16:05:45 twister kernel: [<ffffffff88292425>] :rfcomm:rfcomm_tty_close+0x2f/0x74
Jan 2 16:05:45 twister kernel: [<ffffffff803446bf>] release_dev+0x212/0x5e2
Jan 2 16:05:45 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
Jan 2 16:05:45 twister kernel: [<ffffffff80344a9b>] tty_release+0xc/0x10
Jan 2 16:05:45 twister kernel: [<ffffffff80276f67>] __fput+0xb1/0x16f
Jan 2 16:05:45 twister kernel: [<ffffffff802748b5>] filp_close+0x5d/0x65
Jan 2 16:05:45 twister kernel: [<ffffffff80275a07>] sys_close+0x73/0xa6
Jan 2 16:05:45 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
Jan 2 16:05:45 twister kernel:
Jan 2 16:05:45 twister kernel:
Jan 2 16:05:45 twister kernel: Code: ff 0f 79 05 e8 c9 00 00 00 58 5a 5b c3 41 54 48 8d 47 08 48
Jan 2 16:05:45 twister kernel: RIP [<ffffffff804720a5>] mutex_lock+0x10/0x1d
Jan 2 16:05:45 twister kernel: RSP <ffff8100bce3fd08>
Jan 2 16:05:45 twister kernel: CR2: 00000000000000b8
Jan 2 16:05:45 twister kernel: ---[ end trace da76522f0284e9b6 ]---

So the patch referenced above does not help. But I've found a very easy
way to trigger the bug:

- do a "cat /dev/zero > /dev/rfcomm0"
- switch the phone off
- switch the phone on, and the kernel oopses

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-03 13:16:30

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Wed, Jan 02, 2008 at 04:16:42PM +0100, Gabor Gombas wrote:

> So the patch referenced above does not help. But I've found a very easy
> way to trigger the bug:
>
> - do a "cat /dev/zero > /dev/rfcomm0"
> - switch the phone off
> - switch the phone on, and the kernel oopses

FYI I also remember having oopses with 2.6.22.<something> but I do not
have those logs anymore. Also I now booted 2.6.20.7 and it did not oops.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-04 01:02:19

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Thu, Jan 03, 2008 at 02:16:21PM +0100, Gabor Gombas wrote:
> On Wed, Jan 02, 2008 at 04:16:42PM +0100, Gabor Gombas wrote:
>
> > So the patch referenced above does not help. But I've found a very easy
> > way to trigger the bug:
> >
> > - do a "cat /dev/zero > /dev/rfcomm0"
> > - switch the phone off
> > - switch the phone on, and the kernel oopses
>
> FYI I also remember having oopses with 2.6.22.<something> but I do not
> have those logs anymore. Also I now booted 2.6.20.7 and it did not oops.

Could you try the patch (on 2.6.24-rc6) following and check the debug messages?

diff -upr linux/net/bluetooth/rfcomm/tty.c linux.new/net/bluetooth/rfcomm/tty.c
--- linux/net/bluetooth/rfcomm/tty.c 2008-01-04 08:58:48.000000000 +0800
+++ linux.new/net/bluetooth/rfcomm/tty.c 2008-01-04 09:01:01.000000000 +0800
@@ -313,6 +313,7 @@ static void rfcomm_dev_del(struct rfcomm
{
BT_DBG("dev %p", dev);

+ dump_stack();
set_bit(RFCOMM_TTY_RELEASED, &dev->flags);
rfcomm_dev_put(dev);
}

2008-01-05 07:50:57

by Al Viro

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Wed, Jan 02, 2008 at 04:16:42PM +0100, Gabor Gombas wrote:
> Heh, it seems talking about a bug makes it trigger:
>
> Jan 2 16:05:45 twister kernel: Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:
> Jan 2 16:05:45 twister kernel: [<ffffffff804720a5>] mutex_lock+0x10/0x1d

> So the patch referenced above does not help. But I've found a very easy
> way to trigger the bug:
>
> - do a "cat /dev/zero > /dev/rfcomm0"
> - switch the phone off
> - switch the phone on, and the kernel oopses

sysfs_get_dentry(),
mutex_lock(&parent->d_inode->i_mutex);
hitting parent->d_inode either NULL or very close to it, depending on your
.config; most likely NULL, if offset of i_mutex is 0xb8 in your build.
That's plausible - 0xb8 is what you'd get on UP build without spinlock
debugging, lockdep, etc.

Assuming that this is what we get, everything looks explainable - we
have sysfs_rename_dir() calling sysfs_get_dentry() while the parent
gets evicted. We don't have any exclusion, so while we are playing
silly buggers with lookups in sysfs_get_dentry() we have parent become
negative; the rest is obvious...

AFAICS, the locking here is quite broken and frankly, sysfs_get_dentry()
and the way it plays with fs/namei.c are ucking fugly.

Could you stick
if (!parent->d_inode)
printk(KERN_WARNING "sysfs locking blows: %s",
parent->d_name.name);
right before
mutex_lock(&parent->d_inode->i_mutex);
dentry = lookup_one_noperm(cur->s_name, parent);
mutex_unlock(&parent->d_inode->i_mutex);
in sysfs_get_dentry() (fs/sysfs/dir.c) and verify that it does, indeed,
trigger?

2008-01-05 14:30:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hello.

Al Viro wrote:
> sysfs_get_dentry(),
> mutex_lock(&parent->d_inode->i_mutex);
> hitting parent->d_inode either NULL or very close to it, depending on your
> .config; most likely NULL, if offset of i_mutex is 0xb8 in your build.
> That's plausible - 0xb8 is what you'd get on UP build without spinlock
> debugging, lockdep, etc.
>
> Assuming that this is what we get, everything looks explainable - we
> have sysfs_rename_dir() calling sysfs_get_dentry() while the parent
> gets evicted. We don't have any exclusion, so while we are playing
> silly buggers with lookups in sysfs_get_dentry() we have parent become
> negative; the rest is obvious...

That part of code is walking down the sysfs tree from the s_root of
sysfs hierarchy and on each step parent is held using dget() while being
referenced, so I don't think they can turn negative there.

> AFAICS, the locking here is quite broken and frankly, sysfs_get_dentry()
> and the way it plays with fs/namei.c are ucking fugly.

Can you elaborate a bit? The locking in sysfs is unconventional but
that's mostly from necessity. It has dual interface - vfs and driver
model && vfs data structures (dentry and inode) are too big to always
keep around, so it basically becomes a small distributed file system
where the backing data can change asynchronously.

> Could you stick
> if (!parent->d_inode)
> printk(KERN_WARNING "sysfs locking blows: %s",
> parent->d_name.name);
> right before
> mutex_lock(&parent->d_inode->i_mutex);
> dentry = lookup_one_noperm(cur->s_name, parent);
> mutex_unlock(&parent->d_inode->i_mutex);
> in sysfs_get_dentry() (fs/sysfs/dir.c) and verify that it does, indeed,
> trigger?

Yes, please.

Thanks.

--
tejun

2008-01-05 19:45:25

by Al Viro

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Sat, Jan 05, 2008 at 11:30:25PM +0900, Tejun Heo wrote:
> > Assuming that this is what we get, everything looks explainable - we
> > have sysfs_rename_dir() calling sysfs_get_dentry() while the parent
> > gets evicted. We don't have any exclusion, so while we are playing
> > silly buggers with lookups in sysfs_get_dentry() we have parent become
> > negative; the rest is obvious...
>
> That part of code is walking down the sysfs tree from the s_root of
> sysfs hierarchy and on each step parent is held using dget() while being
> referenced, so I don't think they can turn negative there.

Turn? Just what stops you from getting a negative (and unhashed) from
lookup_one_noperm() and on the next iteration being buggered on mutex_lock()?

> > AFAICS, the locking here is quite broken and frankly, sysfs_get_dentry()
> > and the way it plays with fs/namei.c are ucking fugly.
>
> Can you elaborate a bit? The locking in sysfs is unconventional but
> that's mostly from necessity. It has dual interface - vfs and driver
> model && vfs data structures (dentry and inode) are too big to always
> keep around, so it basically becomes a small distributed file system
> where the backing data can change asynchronously.

... with all fun that creates. As it is, you have those async changers
of backing data using VFS locking _under_ sysfs locks via lookup_one_noperm()
and yet it needs sysfs_mutex inside sysfs_lookup(). So you can't have
sysfs_get_dentry() under it. So you don't have exclusion with arseloads
of sysfs tree changes in there. Joy...

Frankly, with the current state of sysfs the last vestiges of arguments
used to push it into the tree back then are dead and buried. I'm not
blaming you, BTW - the shitpile *did* grow past the point where its
memory footprint became far too large and something needed to be done.
Unfortunately, it happened too late for that something being "get rid
of the entire mess" and now we are saddled with it for good.

2008-01-06 02:08:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hello,

Al Viro wrote:
> On Sat, Jan 05, 2008 at 11:30:25PM +0900, Tejun Heo wrote:
>>> Assuming that this is what we get, everything looks explainable - we
>>> have sysfs_rename_dir() calling sysfs_get_dentry() while the parent
>>> gets evicted. We don't have any exclusion, so while we are playing
>>> silly buggers with lookups in sysfs_get_dentry() we have parent become
>>> negative; the rest is obvious...
>> That part of code is walking down the sysfs tree from the s_root of
>> sysfs hierarchy and on each step parent is held using dget() while being
>> referenced, so I don't think they can turn negative there.
>
> Turn? Just what stops you from getting a negative (and unhashed) from
> lookup_one_noperm() and on the next iteration being buggered on mutex_lock()?

Right, I haven't thought about that. When sysfs_get_dentry() is called,
@sd is always valid so unless there was existing negative dentry, lookup
is guaranteed to return positive dentry, but by populating dcache with
negative dentry before a node is created, things can go wrong. I don't
think that's what's going on here tho. If that was the case, the
while() loop looking up the next sd to lookup (@cur) should have blown
up as negative dentry will have NULL d_fsdata which doesn't match any sd.

I guess what's needed here is d_revalidate() as other distributed
filesystems do. I'll test whether this can be actually triggered and
prepare a fix. Thanks a lot for pointing out the problem.

>>> AFAICS, the locking here is quite broken and frankly, sysfs_get_dentry()
>>> and the way it plays with fs/namei.c are ucking fugly.
>> Can you elaborate a bit? The locking in sysfs is unconventional but
>> that's mostly from necessity. It has dual interface - vfs and driver
>> model && vfs data structures (dentry and inode) are too big to always
>> keep around, so it basically becomes a small distributed file system
>> where the backing data can change asynchronously.
>
> ... with all fun that creates. As it is, you have those async changers
> of backing data using VFS locking _under_ sysfs locks via lookup_one_noperm()
> and yet it needs sysfs_mutex inside sysfs_lookup(). So you can't have
> sysfs_get_dentry() under it. So you don't have exclusion with arseloads
> of sysfs tree changes in there. Joy...

There are two locks. sysfs_rename_mutex and sysfs_mutex.
sysfs_rename_mutex is above VFS locks while sysfs_mutex is below VFS
locks. sysfs_rename_mutex() protects against move/rename which can
change the ancestry of a held sysfs_dirent while sysfs_mutex protects
the sd hierarchy itself. Locking can be wrong if sysfs_rename_mutex
locking is missing from the places where ancestry of a held sd can
change but I can't find one ATM. If I'm missing your point again, feel
free to scream at me. :-)

As it's unnecessarily unintuitive, there's a pending change to rename
sysfs_rename_mutex and use it to protect the whole tree structure to
make locking simpler while using sysfs_mutex to guard VFS access such
that the locking hierarchy plainly becomes sysfs_rename_mutex - VFS
locks - sysfs_mutex where all internal sysfs structure is protected by
the outer mutex and the inner one just protects VFS accesses.

> Frankly, with the current state of sysfs the last vestiges of arguments
> used to push it into the tree back then are dead and buried. I'm not
> blaming you, BTW - the shitpile *did* grow past the point where its
> memory footprint became far too large and something needed to be done.
> Unfortunately, it happened too late for that something being "get rid
> of the entire mess" and now we are saddled with it for good.

Yeah, it's too late to get rid of sysfs and regardless implementation
ugliness, which BTW I think has improved a lot during last six or so
months, it's now pretty useful and important to drivers, so I guess the
only option is trying hard to make it better.

Oh, BTW, the ugly lookup_one_noperm() can be removed if LOOKUP_NOPERM
flag is added. The only reason sysfs_lookup() uses the specialized
lookup is to avoid permission check.

Thanks.

--
tejun

2008-01-06 02:18:39

by Al Viro

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Sun, Jan 06, 2008 at 11:07:52AM +0900, Tejun Heo wrote:

> Right, I haven't thought about that. When sysfs_get_dentry() is called,
> @sd is always valid so unless there was existing negative dentry, lookup
> is guaranteed to return positive dentry, but by populating dcache with
> negative dentry before a node is created, things can go wrong. I don't
> think that's what's going on here tho. If that was the case, the
> while() loop looking up the next sd to lookup (@cur) should have blown
> up as negative dentry will have NULL d_fsdata which doesn't match any sd.

No. What happens if sd gets unlinked while we are on the way to
sysfs_get_dentry() and so does its parent? The parent is off the
sibling list, we get negative dentry from lookup. It's not hashed,
so won't stick around in dcache (which is apparently what you are
thinking about). However, _THIS_ lookup has returned you a dentry
with NULL ->d_inode and you are well and truly buggered.

2008-01-06 02:55:00

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hello, Al Viro.

Al Viro wrote:
> On Sun, Jan 06, 2008 at 11:07:52AM +0900, Tejun Heo wrote:
>
>> Right, I haven't thought about that. When sysfs_get_dentry() is called,
>> @sd is always valid so unless there was existing negative dentry, lookup
>> is guaranteed to return positive dentry, but by populating dcache with
>> negative dentry before a node is created, things can go wrong. I don't
>> think that's what's going on here tho. If that was the case, the
>> while() loop looking up the next sd to lookup (@cur) should have blown
>> up as negative dentry will have NULL d_fsdata which doesn't match any sd.
>
> No. What happens if sd gets unlinked while we are on the way to
> sysfs_get_dentry() and so does its parent?

That means sysfs_remove_dir() is called on parent while other operations
are in progress on children, right? sysfs has never allowed such things
&& AFAIK no one does that. It's somewhat implied in the interface (such
as recursive removing) but I fully agree it's problematic. Things like
these are why I think we need to unify/simplify locking as I wrote
previously.

> The parent is off the
> sibling list, we get negative dentry from lookup. It's not hashed,
> so won't stick around in dcache (which is apparently what you are
> thinking about). However, _THIS_ lookup has returned you a dentry
> with NULL ->d_inode and you are well and truly buggered.

So, the assumption is that while @sd is held and being operated on its
ancestry can never change except for rename and move and those are
protected with sysfs_rename_mutex.

I agree that the locking is awkward but sysfs's internal implementation
has completely changed over last six or so months and the conversion
still isn't complete. It isn't pretty at all but is at least
understandable now, IMHO. :-)

The last pending change is separating out kobject from sysfs and
simplifying locking. I had some disagreement with Greg about the future
of kobject and then there was tsunami of ATA bugs because most major
distros converted to libata from IDE in the second half of the last year
and I'm still buried under it. I'll get back to sysfs once these ATA
bugs subside a bit.

Thanks.

--
tejun

2008-01-06 03:35:50

by Al Viro

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Sun, Jan 06, 2008 at 11:54:43AM +0900, Tejun Heo wrote:
> That means sysfs_remove_dir() is called on parent while other operations
> are in progress on children, right? sysfs has never allowed such things
> && AFAIK no one does that. It's somewhat implied in the interface (such
> as recursive removing) but I fully agree it's problematic. Things like
> these are why I think we need to unify/simplify locking as I wrote
> previously.

All it takes is kobject_rename() or kobject_move() called asynchronously
wrt removal... I don't see an explicit ban for that.

FWIW, what happens here *is* fishy, but I don't see an outright ban on
that in documentation - rfcomm_tty_open() does
device_move(dev->tty_dev, rfcomm_get_device(dev));
when we get openers, rfcomm_tty_close() does
device_move(dev->tty_dev, NULL);
when the number of openers hits zero. Can happen repeatedly.

Note that device_move() with new parent being NULL is explicitly allowed
and handled, so...

2008-01-06 03:55:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hello,

Al Viro wrote:
> On Sun, Jan 06, 2008 at 11:54:43AM +0900, Tejun Heo wrote:
>> That means sysfs_remove_dir() is called on parent while other operations
>> are in progress on children, right? sysfs has never allowed such things
>> && AFAIK no one does that. It's somewhat implied in the interface (such
>> as recursive removing) but I fully agree it's problematic. Things like
>> these are why I think we need to unify/simplify locking as I wrote
>> previously.
>
> All it takes is kobject_rename() or kobject_move() called asynchronously
> wrt removal... I don't see an explicit ban for that.

There really hasn't been any stone-set rules for how sysfs can be used
and what operations are allowed simultaneously although sysfs
implementation always had a lot of assumptions about which ops can and
can't be done simultaneously.

The general assumption is that the caller is responsible for its and its
children's lifetime management which is usually followed as sysfs nodes
are removed when a device goes away or driver detaches at which point no
other ops should be in progress anyway.

I think this stems partially from tight coupling with kobject / driver
model. kobject and driver model have certain rules about how they can
be used (again not explicit enough) and sysfs implementation kind of
piggy backed on those rules and added its own assumptions on top of it.
After a while, it's hard to track what's assumed where and implementing
or changing one feature somewhere breaks some assumption elsewhere.
This is the primary reason why I think sysfs should be an independent
component which the driver model uses not something intertwined into
driver model while having mostly separate implementation.

> FWIW, what happens here *is* fishy, but I don't see an outright ban on
> that in documentation - rfcomm_tty_open() does
> device_move(dev->tty_dev, rfcomm_get_device(dev));
> when we get openers, rfcomm_tty_close() does
> device_move(dev->tty_dev, NULL);
> when the number of openers hits zero. Can happen repeatedly.
>
> Note that device_move() with new parent being NULL is explicitly allowed
> and handled, so...

(cc'ing Kay Sievers) IIRC, that's device class compatibility thing.
Overlapping move's are okay tho as they're protected from each other by
sysfs_rename_mutex. I'll take a look at the rfcomm code and see what's
going on tomorrow. Gotta hit the road now.

Thanks.

--
tejun

2008-01-07 02:38:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hello,

Tejun Heo wrote:
> Al Viro wrote:
>> On Sat, Jan 05, 2008 at 11:30:25PM +0900, Tejun Heo wrote:
>>>> Assuming that this is what we get, everything looks explainable - we
>>>> have sysfs_rename_dir() calling sysfs_get_dentry() while the parent
>>>> gets evicted. We don't have any exclusion, so while we are playing
>>>> silly buggers with lookups in sysfs_get_dentry() we have parent become
>>>> negative; the rest is obvious...
>>> That part of code is walking down the sysfs tree from the s_root of
>>> sysfs hierarchy and on each step parent is held using dget() while being
>>> referenced, so I don't think they can turn negative there.
>> Turn? Just what stops you from getting a negative (and unhashed) from
>> lookup_one_noperm() and on the next iteration being buggered on mutex_lock()?
>
> Right, I haven't thought about that. When sysfs_get_dentry() is called,
> @sd is always valid so unless there was existing negative dentry, lookup
> is guaranteed to return positive dentry, but by populating dcache with
> negative dentry before a node is created, things can go wrong. I don't
> think that's what's going on here tho. If that was the case, the
> while() loop looking up the next sd to lookup (@cur) should have blown
> up as negative dentry will have NULL d_fsdata which doesn't match any sd.
>
> I guess what's needed here is d_revalidate() as other distributed
> filesystems do. I'll test whether this can be actually triggered and
> prepare a fix. Thanks a lot for pointing out the problem.

This can't happen because lookup of non-existent entry doesn't create a
negative dentry. The new dentry is never hashed and killed after lookup
failure, the above scenario can't happen.

That said, the mechanism is a bit too fragile. sysfs currently ensures
that dentry/inode point to the associated sysfs_dirent. This is mainly
remanent of conversion from previous VFS based implementation. I think
the right thing to do here is to make sysfs behave like other proper
distributed filesystems using d_revalidate.

Thanks.

--
tejun

2008-01-07 08:07:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Dave Young wrote:
> On Thu, Jan 03, 2008 at 02:16:21PM +0100, Gabor Gombas wrote:
>> On Wed, Jan 02, 2008 at 04:16:42PM +0100, Gabor Gombas wrote:
>>
>>> So the patch referenced above does not help. But I've found a very easy
>>> way to trigger the bug:
>>>
>>> - do a "cat /dev/zero > /dev/rfcomm0"
>>> - switch the phone off
>>> - switch the phone on, and the kernel oopses
>> FYI I also remember having oopses with 2.6.22.<something> but I do not
>> have those logs anymore. Also I now booted 2.6.20.7 and it did not oops.
>
> Could you try the patch (on 2.6.24-rc6) following and check the debug messages?
>
> diff -upr linux/net/bluetooth/rfcomm/tty.c linux.new/net/bluetooth/rfcomm/tty.c
> --- linux/net/bluetooth/rfcomm/tty.c 2008-01-04 08:58:48.000000000 +0800
> +++ linux.new/net/bluetooth/rfcomm/tty.c 2008-01-04 09:01:01.000000000 +0800
> @@ -313,6 +313,7 @@ static void rfcomm_dev_del(struct rfcomm
> {
> BT_DBG("dev %p", dev);
>
> + dump_stack();
> set_bit(RFCOMM_TTY_RELEASED, &dev->flags);
> rfcomm_dev_put(dev);
> }

Tried to reproduce the problem here but I can't persuade my laptop to
talk to my cell. :-( Gabor, can you please report what Dave Young asked?

Thanks.

--
tejun

2008-01-07 08:22:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Tejun Heo <[email protected]> writes:

> Hello,
>
> Tejun Heo wrote:
>> Al Viro wrote:
>>> On Sat, Jan 05, 2008 at 11:30:25PM +0900, Tejun Heo wrote:
>>>>> Assuming that this is what we get, everything looks explainable - we
>>>>> have sysfs_rename_dir() calling sysfs_get_dentry() while the parent
>>>>> gets evicted. We don't have any exclusion, so while we are playing
>>>>> silly buggers with lookups in sysfs_get_dentry() we have parent become
>>>>> negative; the rest is obvious...
>>>> That part of code is walking down the sysfs tree from the s_root of
>>>> sysfs hierarchy and on each step parent is held using dget() while being
>>>> referenced, so I don't think they can turn negative there.
>>> Turn? Just what stops you from getting a negative (and unhashed) from
>>> lookup_one_noperm() and on the next iteration being buggered on mutex_lock()?
>>
>> Right, I haven't thought about that. When sysfs_get_dentry() is called,
>> @sd is always valid so unless there was existing negative dentry, lookup
>> is guaranteed to return positive dentry, but by populating dcache with
>> negative dentry before a node is created, things can go wrong. I don't
>> think that's what's going on here tho. If that was the case, the
>> while() loop looking up the next sd to lookup (@cur) should have blown
>> up as negative dentry will have NULL d_fsdata which doesn't match any sd.
>>
>> I guess what's needed here is d_revalidate() as other distributed
>> filesystems do. I'll test whether this can be actually triggered and
>> prepare a fix. Thanks a lot for pointing out the problem.
>
> This can't happen because lookup of non-existent entry doesn't create a
> negative dentry. The new dentry is never hashed and killed after lookup
> failure, the above scenario can't happen.
>
> That said, the mechanism is a bit too fragile. sysfs currently ensures
> that dentry/inode point to the associated sysfs_dirent. This is mainly
> remanent of conversion from previous VFS based implementation. I think
> the right thing to do here is to make sysfs behave like other proper
> distributed filesystems using d_revalidate.

Huh? We still need something like sysfs_get_dentry to find the dentries
for the rename or move operation. So we can call d_move.

Further we should be talking about sysfs_move_dir not sysfs_rename_dir
as only the networking code calls sysfs_rename_dir. Not that it is
significantly different.

I think I saw a change in lock ordering recently to help distributed
filesystems, and maybe that will simplify some of this.

Anyway I figure if I am to understand this I better go look and see if
I can find the start of this thread. I don't have a clue
where we are blowing up.

Eric

2008-01-07 09:17:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Eric W. Biederman wrote:
>> That said, the mechanism is a bit too fragile. sysfs currently ensures
>> that dentry/inode point to the associated sysfs_dirent. This is mainly
>> remanent of conversion from previous VFS based implementation. I think
>> the right thing to do here is to make sysfs behave like other proper
>> distributed filesystems using d_revalidate.
>
> Huh? We still need something like sysfs_get_dentry to find the dentries
> for the rename or move operation. So we can call d_move.

Ah... right. Thanks. :-)

--
tejun

2008-01-07 09:18:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Tejun Heo wrote:
> Eric W. Biederman wrote:
>>> That said, the mechanism is a bit too fragile. sysfs currently ensures
>>> that dentry/inode point to the associated sysfs_dirent. This is mainly
>>> remanent of conversion from previous VFS based implementation. I think
>>> the right thing to do here is to make sysfs behave like other proper
>>> distributed filesystems using d_revalidate.
>> Huh? We still need something like sysfs_get_dentry to find the dentries
>> for the rename or move operation. So we can call d_move.
>
> Ah... right. Thanks. :-)

On the second thought, can't those too be dealt with d_revalidate?

--
tejun

2008-01-07 09:23:18

by Al Viro

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Mon, Jan 07, 2008 at 06:18:21PM +0900, Tejun Heo wrote:
> Tejun Heo wrote:
> > Eric W. Biederman wrote:
> >>> That said, the mechanism is a bit too fragile. sysfs currently ensures
> >>> that dentry/inode point to the associated sysfs_dirent. This is mainly
> >>> remanent of conversion from previous VFS based implementation. I think
> >>> the right thing to do here is to make sysfs behave like other proper
> >>> distributed filesystems using d_revalidate.
> >> Huh? We still need something like sysfs_get_dentry to find the dentries
> >> for the rename or move operation. So we can call d_move.
> >
> > Ah... right. Thanks. :-)
>
> On the second thought, can't those too be dealt with d_revalidate?

FVO "dealt with" as pleasant and efficient as using coarse whetstone
to deal with caries.

2008-01-07 10:34:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Al Viro <[email protected]> writes:

> On Mon, Jan 07, 2008 at 06:18:21PM +0900, Tejun Heo wrote:
>> Tejun Heo wrote:
>> > Eric W. Biederman wrote:
>> >>> That said, the mechanism is a bit too fragile. sysfs currently ensures
>> >>> that dentry/inode point to the associated sysfs_dirent. This is mainly
>> >>> remanent of conversion from previous VFS based implementation. I think
>> >>> the right thing to do here is to make sysfs behave like other proper
>> >>> distributed filesystems using d_revalidate.
>> >> Huh? We still need something like sysfs_get_dentry to find the dentries
>> >> for the rename or move operation. So we can call d_move.
>> >
>> > Ah... right. Thanks. :-)
>>
>> On the second thought, can't those too be dealt with d_revalidate?
>
> FVO "dealt with" as pleasant and efficient as using coarse whetstone
> to deal with caries.

Or to say it another way. The linux VFS requires dentries to be
preserved and used as long as we can. dropping them early causes
some weird nasties to show up, in particular it totally messes up
mounting other filesystems on top.

Eric

2008-01-07 14:10:28

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hi,

On Fri, Jan 04, 2008 at 09:05:42AM +0800, Dave Young wrote:

> Could you try the patch (on 2.6.24-rc6) following and check the debug messages?
>
> diff -upr linux/net/bluetooth/rfcomm/tty.c linux.new/net/bluetooth/rfcomm/tty.c
> --- linux/net/bluetooth/rfcomm/tty.c 2008-01-04 08:58:48.000000000 +0800
> +++ linux.new/net/bluetooth/rfcomm/tty.c 2008-01-04 09:01:01.000000000 +0800
> @@ -313,6 +313,7 @@ static void rfcomm_dev_del(struct rfcomm
> {
> BT_DBG("dev %p", dev);
>
> + dump_stack();
> set_bit(RFCOMM_TTY_RELEASED, &dev->flags);
> rfcomm_dev_put(dev);
> }

This did not trigger.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-07 14:13:16

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hi,

On Sat, Jan 05, 2008 at 07:50:39AM +0000, Al Viro wrote:

> Could you stick
> if (!parent->d_inode)
> printk(KERN_WARNING "sysfs locking blows: %s",
> parent->d_name.name);
> right before
> mutex_lock(&parent->d_inode->i_mutex);
> dentry = lookup_one_noperm(cur->s_name, parent);
> mutex_unlock(&parent->d_inode->i_mutex);
> in sysfs_get_dentry() (fs/sysfs/dir.c) and verify that it does, indeed,
> trigger?

Here it is:

Jan 7 14:35:43 twister kernel: sysfs locking blows: acl001BAFE1624D<1>Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:
Jan 7 14:35:43 twister kernel: [<ffffffff80473835>] mutex_lock+0x10/0x1d
Jan 7 14:35:43 twister kernel: PGD b6031067 PUD b609c067 PMD 0
Jan 7 14:35:43 twister kernel: Oops: 0002 [1]
Jan 7 14:35:43 twister kernel: CPU 0
Jan 7 14:35:43 twister kernel: Modules linked in: binfmt_misc rfcomm l2cap nfsd auth_rpcgss exportfs ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt saa7134_alsa radeon hwmon_vid eeprom hci_usb bluetooth usb_storage tuner tea5767 tda8290 tuner_simple mt20xx tea5761 parport_pc parport floppy snd_intel8x0 snd_ac97_codec saa7134 ac97_bus firewire_ohci firewire_core videobuf_dma_sg videobuf_core ir_kbd_i2c sky2 ir_common crc_itu_t snd_pcm forcedeth snd_timer snd_page_alloc ehci_hcd ohci_hcd sg sr_mod cdrom
Jan 7 14:35:43 twister kernel: Pid: 3218, comm: cat Not tainted 2.6.24-rc6debug-dirty #5
Jan 7 14:35:43 twister kernel: RIP: 0010:[<ffffffff80473835>] [<ffffffff80473835>] mutex_lock+0x10/0x1d
Jan 7 14:35:43 twister kernel: RSP: 0018:ffff8100b5733d08 EFLAGS: 00010246
Jan 7 14:35:43 twister kernel: RAX: 0000000000000000 RBX: 00000000000000b8 RCX: 0000000000000001
Jan 7 14:35:43 twister kernel: RDX: ffff8100b5733fd8 RSI: ffff8100bfab6000 RDI: 00000000000000b8
Jan 7 14:35:43 twister kernel: RBP: ffffffff80593db0 R08: ffffffff8057b118 R09: ffffffff806093d0
Jan 7 14:35:43 twister kernel: R10: ffff8100b6074aa0 R11: ffff81000101b100 R12: ffff8100b6074aa0
Jan 7 14:35:43 twister kernel: R13: 00000000fffffff4 R14: ffff8100b571fb80 R15: ffff8100b573e000
Jan 7 14:35:43 twister kernel: FS: 00002aead9fc56e0(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
Jan 7 14:35:43 twister kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 7 14:35:43 twister kernel: CR2: 00000000000000b8 CR3: 00000000b60b2000 CR4: 00000000000006e0
Jan 7 14:35:43 twister kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 7 14:35:43 twister kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 7 14:35:43 twister kernel: Process cat (pid: 3218, threadinfo ffff8100b5732000, task ffff8100bfa70830)
Jan 7 14:35:43 twister kernel: Stack: ffffe2000007e310 ffffffff8028560d ffff8100b202a7d0 ffffffff802af076
Jan 7 14:35:43 twister kernel: ffff8100bfa52f00 ffff8100b6074aa0 ffff8100b618c720 ffffffff802af28b
Jan 7 14:35:43 twister kernel: ffff8100b61308d0 ffff8100b61308d0 ffff8100bfa1a800 ffff8100b618c720
Jan 7 14:35:43 twister kernel: Call Trace:
Jan 7 14:35:43 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
Jan 7 14:35:43 twister kernel: [<ffffffff802af076>] sysfs_get_dentry+0x5e/0xa8
Jan 7 14:35:43 twister kernel: [<ffffffff802af28b>] sysfs_move_dir+0x63/0x204
Jan 7 14:35:43 twister kernel: [<ffffffff803007c5>] kobject_move+0xba/0x110
Jan 7 14:35:43 twister kernel: [<ffffffff803698ec>] device_move+0x59/0x111
Jan 7 14:35:43 twister kernel: [<ffffffff8827c433>] :rfcomm:rfcomm_tty_close+0x2f/0x74
Jan 7 14:35:43 twister kernel: [<ffffffff803447af>] release_dev+0x212/0x5e2
Jan 7 14:35:43 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
Jan 7 14:35:43 twister kernel: [<ffffffff80344b8b>] tty_release+0xc/0x10
Jan 7 14:35:43 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
Jan 7 14:35:43 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
Jan 7 14:35:43 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
Jan 7 14:35:43 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
Jan 7 14:35:43 twister kernel:
Jan 7 14:35:43 twister kernel:
Jan 7 14:35:43 twister kernel: Code: ff 0f 79 05 e8 c9 00 00 00 58 5a 5b c3 41 54 48 8d 47 08 48
Jan 7 14:35:43 twister kernel: RIP [<ffffffff80473835>] mutex_lock+0x10/0x1d
Jan 7 14:35:43 twister kernel: RSP <ffff8100b5733d08>
Jan 7 14:35:43 twister kernel: CR2: 00000000000000b8
Jan 7 14:35:43 twister kernel: ---[ end trace 0e3a1176e9294792 ]---

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-07 15:24:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Gabor Gombas wrote:
> Hi,
>
> On Sat, Jan 05, 2008 at 07:50:39AM +0000, Al Viro wrote:
>
>> Could you stick
>> if (!parent->d_inode)
>> printk(KERN_WARNING "sysfs locking blows: %s",
>> parent->d_name.name);
>> right before
>> mutex_lock(&parent->d_inode->i_mutex);
>> dentry = lookup_one_noperm(cur->s_name, parent);
>> mutex_unlock(&parent->d_inode->i_mutex);
>> in sysfs_get_dentry() (fs/sysfs/dir.c) and verify that it does, indeed,
>> trigger?
>
> Here it is:
>
> Jan 7 14:35:43 twister kernel: sysfs locking blows: acl001BAFE1624D<1>Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:

Does the attached patch fix the problem?

--
tejun


Attachments:
kill-extra-put-in-sysfs_move_dir.patch (341.00 B)

2008-01-07 21:00:36

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Tue, Jan 08, 2008 at 12:24:05AM +0900, Tejun Heo wrote:

> Does the attached patch fix the problem?

No, it still oopses.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-08 09:42:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Gabor Gombas wrote:
> On Tue, Jan 08, 2008 at 12:24:05AM +0900, Tejun Heo wrote:
>
>> Does the attached patch fix the problem?
>
> No, it still oopses.

Thanks. Please apply the attached patch and report the oops.

--
tejun


Attachments:
debug.patch (2.38 kB)

2008-01-08 13:32:29

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Tue, Jan 08, 2008 at 06:42:43PM +0900, Tejun Heo wrote:

> Thanks. Please apply the attached patch and report the oops.

Jan 8 14:23:29 twister kernel: XXX: moving /devices/virtual/tty/rfcomm0 under /devices/pci0000:00/0000:00:02.0/usb1/1-3/1-3:1.0/hci0/acl001BAFE1624D/tty
Jan 8 14:23:29 twister kernel: XXX: looking up / (ffff8100bf400000) / devices
Jan 8 14:23:29 twister kernel: XXX: looking up devices (ffff8100bf658cc0) / virtual
Jan 8 14:23:29 twister kernel: XXX: looking up virtual (ffff8100bf417220) / tty
Jan 8 14:23:29 twister kernel: XXX: looking up tty (ffff8100bb31c440) / rfcomm0
Jan 8 14:23:29 twister kernel: XXX: looking up / (ffff8100bf400000) / devices
Jan 8 14:23:29 twister kernel: XXX: looking up devices (ffff8100bf658cc0) / pci0000:00
Jan 8 14:23:29 twister kernel: XXX: looking up pci0000:00 (ffff8100bf658880) / 0000:00:02.0
Jan 8 14:23:29 twister kernel: XXX: looking up 0000:00:02.0 (ffff8100bb2abcc0) / usb1
Jan 8 14:23:29 twister kernel: XXX: looking up usb1 (ffff8100bb306000) / 1-3
Jan 8 14:23:29 twister kernel: XXX: looking up 1-3 (ffff8100bb309220) / 1-3:1.0
Jan 8 14:23:29 twister kernel: XXX: looking up 1-3:1.0 (ffff8100bb309880) / hci0
Jan 8 14:23:29 twister kernel: XXX: looking up hci0 (ffff8100bb318aa0) / acl001BAFE1624D
Jan 8 14:23:29 twister kernel: XXX: looking up acl001BAFE1624D (ffff8100b555a000) / tty
Jan 8 14:23:52 twister kernel: XXX: moving /devices/pci0000:00/0000:00:02.0/usb1/1-3/1-3:1.0/hci0/acl001BAFE1624D/tty/rfcomm0 under /devices/virtual/tty
Jan 8 14:23:52 twister kernel: XXX: looking up / (ffff8100bf400000) / devices
Jan 8 14:23:52 twister kernel: XXX: looking up devices (ffff8100bf658cc0) / pci0000:00
Jan 8 14:23:52 twister kernel: XXX: looking up pci0000:00 (ffff8100bf658880) / 0000:00:02.0
Jan 8 14:23:52 twister kernel: XXX: looking up 0000:00:02.0 (ffff8100bb2abcc0) / usb1
Jan 8 14:23:52 twister kernel: XXX: looking up usb1 (ffff8100bb306000) / 1-3
Jan 8 14:23:52 twister kernel: XXX: looking up 1-3 (ffff8100bb309220) / 1-3:1.0
Jan 8 14:23:52 twister kernel: XXX: looking up 1-3:1.0 (ffff8100bb309880) / hci0
Jan 8 14:23:52 twister kernel: XXX: looking up hci0 (ffff8100bb318aa0) / acl001BAFE1624D
Jan 8 14:23:52 twister kernel: XXX: looking up acl001BAFE1624D (0000000000000000) /
Jan 8 14:23:52 twister kernel: Unable to handle kernel NULL pointer dereference at 00000000000000b8 RIP:
Jan 8 14:23:52 twister kernel: [<ffffffff80473875>] mutex_lock+0x10/0x1d
Jan 8 14:23:52 twister kernel: PGD b19b6067 PUD b1969067 PMD 0
Jan 8 14:23:52 twister kernel: Oops: 0002 [1]
Jan 8 14:23:52 twister kernel: CPU 0
Jan 8 14:23:52 twister kernel: Modules linked in: binfmt_misc rfcomm l2cap nfsd auth_rpcgss exportfs ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt saa7134_alsa radeon hwmon_vid eeprom hci_usb bluetooth usb_storage tuner tea5767 tda8290 tuner_simple mt20xx tea5761 snd_intel8x0 sg snd_ac97_codec ac97_bus saa7134 firewire_ohci firewire_core crc_itu_t videobuf_dma_sg videobuf_core ir_kbd_i2c snd_pcm ir_common sky2 forcedeth ehci_hcd snd_timer snd_page_alloc ohci_hcd sr_mod parport_pc parport cdrom floppy
Jan 8 14:23:52 twister kernel: Pid: 3228, comm: cat Not tainted 2.6.24-rc6debug-dirty #7
Jan 8 14:23:52 twister kernel: RIP: 0010:[<ffffffff80473875>] [<ffffffff80473875>] mutex_lock+0x10/0x1d
Jan 8 14:23:52 twister kernel: RSP: 0018:ffff8100b19a5d08 EFLAGS: 00010246
Jan 8 14:23:52 twister kernel: RAX: 0000000000000000 RBX: 00000000000000b8 RCX: 0000000000010352
Jan 8 14:23:52 twister kernel: RDX: ffff8100b19a5fd8 RSI: 0000000000010352 RDI: 00000000000000b8
Jan 8 14:23:52 twister kernel: RBP: ffffffff80593db0 R08: 0000000000000000 R09: 0000000000000097
Jan 8 14:23:52 twister kernel: R10: 000000000000000e R11: ffffffff803103d4 R12: ffff8100ba312b90
Jan 8 14:23:52 twister kernel: R13: ffff8100bfa78d00 R14: ffff8100b1980b00 R15: ffff8100b6c6b840
Jan 8 14:23:52 twister kernel: FS: 00002b3bd213d6e0(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
Jan 8 14:23:52 twister kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 8 14:23:52 twister kernel: CR2: 00000000000000b8 CR3: 00000000b19ae000 CR4: 00000000000006e0
Jan 8 14:23:52 twister kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 8 14:23:52 twister kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 8 14:23:52 twister kernel: Process cat (pid: 3228, threadinfo ffff8100b19a4000, task ffff8100b18d9060)
Jan 8 14:23:52 twister kernel: Stack: ffff8100bfa78d00 ffffffff8028560d ffff8100b554ca28 ffffffff802ae9d5
Jan 8 14:23:52 twister kernel: ffff8100bf9f4fa0 ffff8100ba312b90 ffff8100b5326c60 ffffffff802af2de
Jan 8 14:23:52 twister kernel: ffff8100b6ca4cd0 ffff8100b6ca4cd0 ffff8100bfa78d00 ffff8100b5326ba0
Jan 8 14:23:52 twister kernel: Call Trace:
Jan 8 14:23:52 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
Jan 8 14:23:52 twister kernel: [<ffffffff802ae9d5>] __sysfs_get_dentry+0x5f/0xa9
Jan 8 14:23:52 twister kernel: [<ffffffff802af2de>] sysfs_move_dir+0xae/0x247
Jan 8 14:23:52 twister kernel: [<ffffffff80300805>] kobject_move+0xba/0x110
Jan 8 14:23:52 twister kernel: [<ffffffff8036992c>] device_move+0x59/0x111
Jan 8 14:23:52 twister kernel: [<ffffffff8827c433>] :rfcomm:rfcomm_tty_close+0x2f/0x74
Jan 8 14:23:52 twister kernel: [<ffffffff803447ef>] release_dev+0x212/0x5e2
Jan 8 14:23:52 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
Jan 8 14:23:52 twister kernel: [<ffffffff80344bcb>] tty_release+0xc/0x10
Jan 8 14:23:52 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
Jan 8 14:23:52 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
Jan 8 14:23:52 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
Jan 8 14:23:52 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
Jan 8 14:23:52 twister kernel:
Jan 8 14:23:52 twister kernel:
Jan 8 14:23:52 twister kernel: Code: ff 0f 79 05 e8 c9 00 00 00 58 5a 5b c3 41 54 48 8d 47 08 48
Jan 8 14:23:52 twister kernel: RIP [<ffffffff80473875>] mutex_lock+0x10/0x1d
Jan 8 14:23:52 twister kernel: RSP <ffff8100b19a5d08>
Jan 8 14:23:52 twister kernel: CR2: 00000000000000b8
Jan 8 14:23:52 twister kernel: ---[ end trace d7aa1b498ca58476 ]---

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-09 09:16:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hello,

My laptop and cell finally decided to talk to each other and I could
reproduce the bug here. The attached patch should remove the oops.

The bug is two folded. I just skimmed through the bluetooth code and am
very likely to wrong at places. Please correct me where I'm wrong.

1. It's introduced by class device migration and the bug will go away
with CONFIG_SYSFS_DEPRECATED turned on. With CONFIG_SYSFS_DEPRECATED
turned off, what used to be class devices live under actual devices.
For rfcomm, this means the rfcommN tty device now lives under the
aclXXXX node (probably represents an active connection?) instead of the
class directory.

It seems rfcommN devices are supposed to survive over disconnects so the
rfcommN device moves under the live connection while connection is alive
and retreats back to a default directory when the connection is lost.
This is all fine and dandy as long as the rfcommN device lives under
class directory as class directory never goes away.

However, with recent sysfs updates, rfcommN now lives directly under the
aclXXXX node. If the connection goes away while rfcommN device is under
it, it gets deleted but rfcommN is still treated as alive.

This isn't supported. sysfs doesn't allow parents to die first and the
dangling children to be salvaged using sysfs_move().

2. Which in turn exposes three bugs in sysfs
- sysfs_lookup() returning NULL on negative lookup. sysfs
code requires that all negative dentries are shot down.
lookup should return -ENOENT instead of NULL.
- in move and rename, error handling is wrong. It ends up
passing ERR_PTR() values to dput().
- there was an extra dput() in sysfs_move_dir().

The attached patch fixes all sysfs bugs and removes the oops. However,
rfcommX moving is still broken. The rfcommX device won't be visible
from sysfs tree after the initial move failure and all following moves
will fail.

Please confirm the attached patch fixes the oops. I'll separate it into
two patches and forward them to Greg. But bluetooth code also needs to
be updated such that it moves the refcommX device before killing the
connection node.

Thanks.

--
tejun


Attachments:
sysfs-fixes.patch (1.52 kB)

2008-01-09 15:58:31

by Cornelia Huck

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Wed, 09 Jan 2008 18:16:02 +0900,
Tejun Heo <[email protected]> wrote:

> This isn't supported. sysfs doesn't allow parents to die first and the
> dangling children to be salvaged using sysfs_move().

But (with the sysfs bugs fixed) it will return an error, won't it? It
seems the bluetooth code is rather optimistic with never checking
device_move()'s return code.

>
> 2. Which in turn exposes three bugs in sysfs
> - sysfs_lookup() returning NULL on negative lookup. sysfs
> code requires that all negative dentries are shot down.
> lookup should return -ENOENT instead of NULL.
> - in move and rename, error handling is wrong. It ends up
> passing ERR_PTR() values to dput().
> - there was an extra dput() in sysfs_move_dir().

Ups, seem to have copied that from some old version of
sysfs_rename_dir()...

>
> The attached patch fixes all sysfs bugs and removes the oops.

Looks OK at first glance.

2008-01-10 01:08:12

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Wed, Jan 09, 2008 at 06:16:02PM +0900, Tejun Heo wrote:
> Hello,
>
> My laptop and cell finally decided to talk to each other and I could
> reproduce the bug here. The attached patch should remove the oops.
>
> The bug is two folded. I just skimmed through the bluetooth code and am
> very likely to wrong at places. Please correct me where I'm wrong.
>
> 1. It's introduced by class device migration and the bug will go away
> with CONFIG_SYSFS_DEPRECATED turned on. With CONFIG_SYSFS_DEPRECATED
> turned off, what used to be class devices live under actual devices.
> For rfcomm, this means the rfcommN tty device now lives under the
> aclXXXX node (probably represents an active connection?) instead of the
> class directory.
>
> It seems rfcommN devices are supposed to survive over disconnects so the
> rfcommN device moves under the live connection while connection is alive
> and retreats back to a default directory when the connection is lost.
> This is all fine and dandy as long as the rfcommN device lives under
> class directory as class directory never goes away.
>
> However, with recent sysfs updates, rfcommN now lives directly under the
> aclXXXX node. If the connection goes away while rfcommN device is under
> it, it gets deleted but rfcommN is still treated as alive.
>
> This isn't supported. sysfs doesn't allow parents to die first and the
> dangling children to be salvaged using sysfs_move().
>
> 2. Which in turn exposes three bugs in sysfs
> - sysfs_lookup() returning NULL on negative lookup. sysfs
> code requires that all negative dentries are shot down.
> lookup should return -ENOENT instead of NULL.
> - in move and rename, error handling is wrong. It ends up
> passing ERR_PTR() values to dput().
> - there was an extra dput() in sysfs_move_dir().
>
> The attached patch fixes all sysfs bugs and removes the oops. However,
> rfcommX moving is still broken. The rfcommX device won't be visible
> from sysfs tree after the initial move failure and all following moves
> will fail.
>
> Please confirm the attached patch fixes the oops. I'll separate it into
> two patches and forward them to Greg. But bluetooth code also needs to
> be updated such that it moves the refcommX device before killing the
> connection node.
For bluetooth device_move, the only child device of hci_conn dev is the rfcomm tty_dev. How about the following patch, please verify :

---
net/bluetooth/hci_sysfs.c | 7 +++++++
1 file changed, 7 insertions(+)

diff -upr linux/net/bluetooth/hci_sysfs.c linux.new/net/bluetooth/hci_sysfs.c
--- linux/net/bluetooth/hci_sysfs.c 2008-01-10 09:01:51.000000000 +0800
+++ linux.new/net/bluetooth/hci_sysfs.c 2008-01-10 09:01:22.000000000 +0800
@@ -316,9 +316,16 @@ void hci_conn_add_sysfs(struct hci_conn
schedule_work(&conn->work);
}

+static int hci_conn_move_child(struct device *dev, void *data)
+{
+ device_move(dev, NULL);
+ return 0;
+}
+
static void del_conn(struct work_struct *work)
{
struct hci_conn *conn = container_of(work, struct hci_conn, work);
+ device_for_each_child(&conn->dev, NULL, hci_conn_move_child);
device_del(&conn->dev);
put_device(&conn->dev);
}

>
> Thanks.
>
> --
> tejun

> diff --git a/drivers/ata/libata-pmp.c b/drivers/ata/libata-pmp.c
> diff --git a/fs/dcache.c b/fs/dcache.c
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 3371629..f281cc6 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -678,8 +678,10 @@ static struct dentry * sysfs_lookup(struct inode *dir, struct dentry *dentry,
> sd = sysfs_find_dirent(parent_sd, dentry->d_name.name);
>
> /* no such entry */
> - if (!sd)
> + if (!sd) {
> + ret = ERR_PTR(-ENOENT);
> goto out_unlock;
> + }
>
> /* attach dentry and inode */
> inode = sysfs_get_inode(sd);
> @@ -781,6 +783,7 @@ int sysfs_rename_dir(struct kobject * kobj, const char *new_name)
> old_dentry = sysfs_get_dentry(sd);
> if (IS_ERR(old_dentry)) {
> error = PTR_ERR(old_dentry);
> + old_dentry = NULL;
> goto out;
> }
>
> @@ -848,6 +851,7 @@ int sysfs_move_dir(struct kobject *kobj, struct kobject *new_parent_kobj)
> old_dentry = sysfs_get_dentry(sd);
> if (IS_ERR(old_dentry)) {
> error = PTR_ERR(old_dentry);
> + old_dentry = NULL;
> goto out;
> }
> old_parent = old_dentry->d_parent;
> @@ -855,6 +859,7 @@ int sysfs_move_dir(struct kobject *kobj, struct kobject *new_parent_kobj)
> new_parent = sysfs_get_dentry(new_parent_sd);
> if (IS_ERR(new_parent)) {
> error = PTR_ERR(new_parent);
> + new_parent = NULL;
> goto out;
> }
>
> @@ -878,7 +883,6 @@ again:
> error = 0;
> d_add(new_dentry, NULL);
> d_move(old_dentry, new_dentry);
> - dput(new_dentry);
>
> /* Remove from old parent's list and insert into new parent's list. */
> sysfs_unlink_sibling(sd);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h

2008-01-10 10:15:43

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hi,

On Wed, Jan 09, 2008 at 06:16:02PM +0900, Tejun Heo wrote:

> Please confirm the attached patch fixes the oops. I'll separate it into
> two patches and forward them to Greg. But bluetooth code also needs to
> be updated such that it moves the refcommX device before killing the
> connection node.

The kernel survives switching the phone off & on but then oopses when
pppd is started:

Jan 10 10:59:09 twister kernel: Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
Jan 10 10:59:09 twister kernel: [<ffffffff802af03f>] sysfs_get_dentry+0x27/0x8f
Jan 10 10:59:09 twister kernel: PGD 260bb067 PUD 260bc067 PMD 0
Jan 10 10:59:09 twister kernel: Oops: 0000 [1]
Jan 10 10:59:09 twister kernel: CPU 0
Jan 10 10:59:09 twister kernel: Modules linked in: ppp_generic slhc binfmt_misc rfcomm l2cap nfsd auth_rpcgss exportfs ipt_REJECT xt_tcpudp ipt_LOG xt_limit iptable_filter ip_tables x_tables nfs lockd nfs_acl sunrpc fuse dm_crypt saa7134_alsa radeon hwmon_vid eeprom hci_usb bluetooth usb_storage tuner tea5767 tda8290 tuner_simple mt20xx tea5761 snd_intel8x0 snd_ac97_codec saa7134 firewire_ohci firewire_core sg ac97_bus videobuf_dma_sg videobuf_core ir_kbd_i2c crc_itu_t ir_common snd_pcm parport_pc parport sky2 forcedeth sr_mod snd_timer snd_page_alloc ehci_hcd ohci_hcd floppy cdrom
Jan 10 10:59:09 twister kernel: Pid: 4490, comm: pppd Not tainted 2.6.24-rc6debug-dirty #8
Jan 10 10:59:09 twister kernel: RIP: 0010:[<ffffffff802af03f>] [<ffffffff802af03f>] sysfs_get_dentry+0x27/0x8f
Jan 10 10:59:09 twister kernel: RSP: 0018:ffff8100260dbc88 EFLAGS: 00010207
Jan 10 10:59:09 twister kernel: RAX: 0000000000000000 RBX: ffff8100251577d0 RCX: 0000000000000000
Jan 10 10:59:09 twister kernel: RDX: ffff8100260e3410 RSI: ffffffff80643d70 RDI: ffff8100bb32d898
Jan 10 10:59:09 twister kernel: RBP: 0000000000000000 R08: 6390dfaa7e3bd395 R09: ffffffff8050cee6
Jan 10 10:59:09 twister kernel: R10: ffff8100260dbcb8 R11: 0000003000000018 R12: ffff8100b6ef4b90
Jan 10 10:59:09 twister kernel: R13: 00000000fffffff4 R14: ffff81004d9d2600 R15: ffff8100b6eb227e
Jan 10 10:59:09 twister kernel: FS: 00002b5731903ac0(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
Jan 10 10:59:09 twister kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 10 10:59:09 twister kernel: CR2: 0000000000000008 CR3: 00000000260b1000 CR4: 00000000000006e0
Jan 10 10:59:09 twister kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 10 10:59:09 twister kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 10 10:59:09 twister kernel: Process pppd (pid: 4490, threadinfo ffff8100260da000, task ffff81009550d060)
Jan 10 10:59:09 twister kernel: Stack: ffff8100260e3690 ffff8100b6ef4b90 ffff81003a5942a0 ffffffff802af272
Jan 10 10:59:09 twister kernel: ffff8100becca2d0 ffff8100becca2d0 ffff8100260df300 ffff81003a5942a0
Jan 10 10:59:09 twister kernel: 00000000fffffff4 ffff81004d9d2600 ffff8100b6eb227e ffffffff803007a5
Jan 10 10:59:09 twister kernel: Call Trace:
Jan 10 10:59:09 twister kernel: [<ffffffff802af272>] sysfs_move_dir+0x63/0x1fd
Jan 10 10:59:09 twister kernel: [<ffffffff803007a5>] kobject_move+0xba/0x110
Jan 10 10:59:09 twister kernel: [<ffffffff803698cc>] device_move+0x59/0x111
Jan 10 10:59:09 twister kernel: [<ffffffff8827d31e>] :rfcomm:rfcomm_tty_open+0x1c1/0x1db
Jan 10 10:59:09 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
Jan 10 10:59:09 twister kernel: [<ffffffff80345a2d>] tty_open+0x1c5/0x2af
Jan 10 10:59:09 twister kernel: [<ffffffff80278590>] chrdev_open+0x113/0x132
Jan 10 10:59:09 twister kernel: [<ffffffff8027847d>] chrdev_open+0x0/0x132
Jan 10 10:59:09 twister kernel: [<ffffffff80274b4a>] __dentry_open+0xda/0x1aa
Jan 10 10:59:09 twister kernel: [<ffffffff80274cd1>] do_filp_open+0x2d/0x3d
Jan 10 10:59:09 twister kernel: [<ffffffff80473812>] mutex_lock+0xd/0x1d
Jan 10 10:59:09 twister kernel: [<ffffffff802749d6>] get_unused_fd_flags+0x6d/0x100
Jan 10 10:59:09 twister kernel: [<ffffffff80274d27>] do_sys_open+0x46/0xc3
Jan 10 10:59:09 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
Jan 10 10:59:09 twister kernel:
Jan 10 10:59:09 twister kernel:
Jan 10 10:59:09 twister kernel: Code: 48 8b 45 08 48 39 d0 74 05 48 89 c5 eb f2 48 8b 7b 08 48 81
Jan 10 10:59:09 twister kernel: RIP [<ffffffff802af03f>] sysfs_get_dentry+0x27/0x8f
Jan 10 10:59:09 twister kernel: RSP <ffff8100260dbc88>
Jan 10 10:59:09 twister kernel: CR2: 0000000000000008
Jan 10 10:59:09 twister kernel: ---[ end trace b33fe66800773e0c ]---

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-11 23:10:01

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Thu, Jan 10, 2008 at 09:11:17AM +0800, Dave Young wrote:

> For bluetooth device_move, the only child device of hci_conn dev is
> the rfcomm tty_dev. How about the following patch, please verify :

There is now no oops, instead the keyboard becomes almost completely
unresponsible when I switch off & on the phone. The mouse still works
(tested both with X and with the VGA console), but terminal input and VT
switching is dead. On the VGA console, even scrolling using Shift+PgUp
stops working.

Alt+SysRq+w works, trace is below. Ctrl+Alt+Del also works, but then
init complains about hung processes and that it can not umount the file
systems and cannot stop the RAID arrays. Once it still rebooted though,
the second time it got hung after trying to umount the filesystems, and
I had to use Alt+SysRq+b.

If I can choose then I prefer the Oops...

Jan 11 23:46:35 twister kernel: SysRq : Emergency Sync
Jan 11 23:46:35 twister kernel: Emergency Sync complete
Jan 11 23:46:42 twister kernel: SysRq : Show Blocked State
Jan 11 23:46:42 twister kernel: task PC stack pid father
Jan 11 23:46:42 twister kernel: events/0 D ffffffff80487190 0 5 2
Jan 11 23:46:42 twister kernel: ffff8100bf84fd40 0000000000000046 ffff8100ae7f2600 ffff8100bf84fd00
Jan 11 23:46:42 twister kernel: ffff8100bf84a830 ffff8100bd872000 ffffffff80652ae0 0000000100000100
Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
Jan 11 23:46:42 twister kernel: Call Trace:
Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
Jan 11 23:46:42 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
Jan 11 23:46:42 twister kernel: [<ffffffff802af3fd>] sysfs_move_dir+0x1ee/0x1fd
Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
Jan 11 23:46:42 twister kernel: [<ffffffff80471d72>] klist_del+0x15/0x2e
Jan 11 23:46:42 twister kernel: [<ffffffff803698f3>] device_move+0x80/0x111
Jan 11 23:46:42 twister kernel: [<ffffffff88146a27>] :bluetooth:hci_conn_move_child+0x0/0xf
Jan 11 23:46:42 twister kernel: [<ffffffff88146a32>] :bluetooth:hci_conn_move_child+0xb/0xf
Jan 11 23:46:42 twister kernel: [<ffffffff80369836>] device_for_each_child+0x22/0x4d
Jan 11 23:46:42 twister kernel: [<ffffffff88146a05>] :bluetooth:del_conn+0x0/0x22
Jan 11 23:46:42 twister kernel: [<ffffffff88146a1e>] :bluetooth:del_conn+0x19/0x22
Jan 11 23:46:42 twister kernel: [<ffffffff8023696d>] run_workqueue+0x74/0xee
Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
Jan 11 23:46:42 twister kernel: [<ffffffff802370d2>] worker_thread+0xda/0xe7
Jan 11 23:46:42 twister kernel: [<ffffffff80239b77>] autoremove_wake_function+0x0/0x2e
Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
Jan 11 23:46:42 twister kernel: [<ffffffff802399a0>] kthread+0x47/0x73
Jan 11 23:46:42 twister kernel: [<ffffffff8020bdc8>] child_rip+0xa/0x12
Jan 11 23:46:42 twister kernel: [<ffffffff80239959>] kthread+0x0/0x73
Jan 11 23:46:42 twister kernel: [<ffffffff8020bdbe>] child_rip+0x0/0x12
Jan 11 23:46:42 twister kernel:
Jan 11 23:46:42 twister kernel: cat D 7fffffffffffffff 0 3978 3965
Jan 11 23:46:42 twister kernel: ffff81009e857ce8 0000000000000086 ffff8100ae7f2900 ffff81009e857ca8
Jan 11 23:46:42 twister kernel: ffff8100bd872000 ffff8100bf9e7060 ffffffff80652ae0 00000001000000c0
Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
Jan 11 23:46:42 twister kernel: Call Trace:
Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
Jan 11 23:46:42 twister kernel: [<ffffffff80223ccc>] __dequeue_entity+0x1c/0x32
Jan 11 23:46:42 twister kernel: [<ffffffff80223d05>] set_next_entity+0x23/0x73
Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
Jan 11 23:46:42 twister kernel: [<ffffffff80236b02>] flush_cpu_workqueue+0x50/0x58
Jan 11 23:46:42 twister kernel: [<ffffffff80236c32>] wq_barrier_func+0x0/0x9
Jan 11 23:46:42 twister kernel: [<ffffffff80236ecf>] flush_workqueue+0x9/0x12
Jan 11 23:46:42 twister kernel: [<ffffffff803449f9>] release_dev+0x47c/0x5e2
Jan 11 23:46:42 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
Jan 11 23:46:42 twister kernel: [<ffffffff80344b6b>] tty_release+0xc/0x10
Jan 11 23:46:42 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
Jan 11 23:46:42 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
Jan 11 23:46:42 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
Jan 11 23:46:42 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
Jan 11 23:46:42 twister kernel:

If I'm not mistaken the "cat" above is the "cat /dev/zero >
/dev/rfcomm0" command.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-14 07:05:31

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Jan 12, 2008 7:09 AM, Gabor Gombas <[email protected]> wrote:
> On Thu, Jan 10, 2008 at 09:11:17AM +0800, Dave Young wrote:
>
> > For bluetooth device_move, the only child device of hci_conn dev is
> > the rfcomm tty_dev. How about the following patch, please verify :
>
> There is now no oops, instead the keyboard becomes almost completely
> unresponsible when I switch off & on the phone. The mouse still works
> (tested both with X and with the VGA console), but terminal input and VT
> switching is dead. On the VGA console, even scrolling using Shift+PgUp
> stops working.
>
> Alt+SysRq+w works, trace is below. Ctrl+Alt+Del also works, but then
> init complains about hung processes and that it can not umount the file
> systems and cannot stop the RAID arrays. Once it still rebooted though,
> the second time it got hung after trying to umount the filesystems, and
> I had to use Alt+SysRq+b.
>
> If I can choose then I prefer the Oops...
>
> Jan 11 23:46:35 twister kernel: SysRq : Emergency Sync
> Jan 11 23:46:35 twister kernel: Emergency Sync complete
> Jan 11 23:46:42 twister kernel: SysRq : Show Blocked State
> Jan 11 23:46:42 twister kernel: task PC stack pid father
> Jan 11 23:46:42 twister kernel: events/0 D ffffffff80487190 0 5 2
> Jan 11 23:46:42 twister kernel: ffff8100bf84fd40 0000000000000046 ffff8100ae7f2600 ffff8100bf84fd00
> Jan 11 23:46:42 twister kernel: ffff8100bf84a830 ffff8100bd872000 ffffffff80652ae0 0000000100000100
> Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> Jan 11 23:46:42 twister kernel: Call Trace:
> Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> Jan 11 23:46:42 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
> Jan 11 23:46:42 twister kernel: [<ffffffff802af3fd>] sysfs_move_dir+0x1ee/0x1fd
> Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> Jan 11 23:46:42 twister kernel: [<ffffffff80471d72>] klist_del+0x15/0x2e
> Jan 11 23:46:42 twister kernel: [<ffffffff803698f3>] device_move+0x80/0x111
> Jan 11 23:46:42 twister kernel: [<ffffffff88146a27>] :bluetooth:hci_conn_move_child+0x0/0xf
> Jan 11 23:46:42 twister kernel: [<ffffffff88146a32>] :bluetooth:hci_conn_move_child+0xb/0xf
> Jan 11 23:46:42 twister kernel: [<ffffffff80369836>] device_for_each_child+0x22/0x4d
> Jan 11 23:46:42 twister kernel: [<ffffffff88146a05>] :bluetooth:del_conn+0x0/0x22
> Jan 11 23:46:42 twister kernel: [<ffffffff88146a1e>] :bluetooth:del_conn+0x19/0x22
> Jan 11 23:46:42 twister kernel: [<ffffffff8023696d>] run_workqueue+0x74/0xee
> Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> Jan 11 23:46:42 twister kernel: [<ffffffff802370d2>] worker_thread+0xda/0xe7
> Jan 11 23:46:42 twister kernel: [<ffffffff80239b77>] autoremove_wake_function+0x0/0x2e
> Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> Jan 11 23:46:42 twister kernel: [<ffffffff802399a0>] kthread+0x47/0x73
> Jan 11 23:46:42 twister kernel: [<ffffffff8020bdc8>] child_rip+0xa/0x12
> Jan 11 23:46:42 twister kernel: [<ffffffff80239959>] kthread+0x0/0x73
> Jan 11 23:46:42 twister kernel: [<ffffffff8020bdbe>] child_rip+0x0/0x12
> Jan 11 23:46:42 twister kernel:
> Jan 11 23:46:42 twister kernel: cat D 7fffffffffffffff 0 3978 3965
> Jan 11 23:46:42 twister kernel: ffff81009e857ce8 0000000000000086 ffff8100ae7f2900 ffff81009e857ca8
> Jan 11 23:46:42 twister kernel: ffff8100bd872000 ffff8100bf9e7060 ffffffff80652ae0 00000001000000c0
> Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> Jan 11 23:46:42 twister kernel: Call Trace:
> Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> Jan 11 23:46:42 twister kernel: [<ffffffff80223ccc>] __dequeue_entity+0x1c/0x32
> Jan 11 23:46:42 twister kernel: [<ffffffff80223d05>] set_next_entity+0x23/0x73
> Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> Jan 11 23:46:42 twister kernel: [<ffffffff80236b02>] flush_cpu_workqueue+0x50/0x58
> Jan 11 23:46:42 twister kernel: [<ffffffff80236c32>] wq_barrier_func+0x0/0x9
> Jan 11 23:46:42 twister kernel: [<ffffffff80236ecf>] flush_workqueue+0x9/0x12
> Jan 11 23:46:42 twister kernel: [<ffffffff803449f9>] release_dev+0x47c/0x5e2
> Jan 11 23:46:42 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
> Jan 11 23:46:42 twister kernel: [<ffffffff80344b6b>] tty_release+0xc/0x10
> Jan 11 23:46:42 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
> Jan 11 23:46:42 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
> Jan 11 23:46:42 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
> Jan 11 23:46:42 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
> Jan 11 23:46:42 twister kernel:
>
> If I'm not mistaken the "cat" above is the "cat /dev/zero >
> /dev/rfcomm0" command.
>

Any idea about this?

Add marcel and davem in cc-list.

>
> Gabor
>
> --
> ---------------------------------------------------------
> MTA SZTAKI Computer and Automation Research Institute
> Hungarian Academy of Sciences
> ---------------------------------------------------------
>

2008-01-14 12:52:41

by Cornelia Huck

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Mon, 14 Jan 2008 15:05:19 +0800,
"Dave Young" <[email protected]> wrote:

> On Jan 12, 2008 7:09 AM, Gabor Gombas <[email protected]> wrote:
> > On Thu, Jan 10, 2008 at 09:11:17AM +0800, Dave Young wrote:
> >
> > > For bluetooth device_move, the only child device of hci_conn dev is
> > > the rfcomm tty_dev. How about the following patch, please verify :
> >
> > There is now no oops, instead the keyboard becomes almost completely
> > unresponsible when I switch off & on the phone. The mouse still works
> > (tested both with X and with the VGA console), but terminal input and VT
> > switching is dead. On the VGA console, even scrolling using Shift+PgUp
> > stops working.
> >
> > Alt+SysRq+w works, trace is below. Ctrl+Alt+Del also works, but then
> > init complains about hung processes and that it can not umount the file
> > systems and cannot stop the RAID arrays. Once it still rebooted though,
> > the second time it got hung after trying to umount the filesystems, and
> > I had to use Alt+SysRq+b.
> >
> > If I can choose then I prefer the Oops...
> >
> > Jan 11 23:46:35 twister kernel: SysRq : Emergency Sync
> > Jan 11 23:46:35 twister kernel: Emergency Sync complete
> > Jan 11 23:46:42 twister kernel: SysRq : Show Blocked State
> > Jan 11 23:46:42 twister kernel: task PC stack pid father
> > Jan 11 23:46:42 twister kernel: events/0 D ffffffff80487190 0 5 2
> > Jan 11 23:46:42 twister kernel: ffff8100bf84fd40 0000000000000046 ffff8100ae7f2600 ffff8100bf84fd00
> > Jan 11 23:46:42 twister kernel: ffff8100bf84a830 ffff8100bd872000 ffffffff80652ae0 0000000100000100
> > Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> > Jan 11 23:46:42 twister kernel: Call Trace:
> > Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> > Jan 11 23:46:42 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
> > Jan 11 23:46:42 twister kernel: [<ffffffff802af3fd>] sysfs_move_dir+0x1ee/0x1fd
> > Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> > Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> > Jan 11 23:46:42 twister kernel: [<ffffffff80471d72>] klist_del+0x15/0x2e
> > Jan 11 23:46:42 twister kernel: [<ffffffff803698f3>] device_move+0x80/0x111
> > Jan 11 23:46:42 twister kernel: [<ffffffff88146a27>] :bluetooth:hci_conn_move_child+0x0/0xf
> > Jan 11 23:46:42 twister kernel: [<ffffffff88146a32>] :bluetooth:hci_conn_move_child+0xb/0xf
> > Jan 11 23:46:42 twister kernel: [<ffffffff80369836>] device_for_each_child+0x22/0x4d

The problem here is that device_for_each_child() will elevate the
klist_node's refcount while the callback function calls device_move().
device_move() calls klist_remove() on the same node, which will do a
klist_del() and then wait for the refcount to drop to 0...

> > Jan 11 23:46:42 twister kernel: [<ffffffff88146a05>] :bluetooth:del_conn+0x0/0x22
> > Jan 11 23:46:42 twister kernel: [<ffffffff88146a1e>] :bluetooth:del_conn+0x19/0x22
> > Jan 11 23:46:42 twister kernel: [<ffffffff8023696d>] run_workqueue+0x74/0xee
> > Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> > Jan 11 23:46:42 twister kernel: [<ffffffff802370d2>] worker_thread+0xda/0xe7
> > Jan 11 23:46:42 twister kernel: [<ffffffff80239b77>] autoremove_wake_function+0x0/0x2e
> > Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> > Jan 11 23:46:42 twister kernel: [<ffffffff802399a0>] kthread+0x47/0x73
> > Jan 11 23:46:42 twister kernel: [<ffffffff8020bdc8>] child_rip+0xa/0x12
> > Jan 11 23:46:42 twister kernel: [<ffffffff80239959>] kthread+0x0/0x73
> > Jan 11 23:46:42 twister kernel: [<ffffffff8020bdbe>] child_rip+0x0/0x12
> > Jan 11 23:46:42 twister kernel:
> > Jan 11 23:46:42 twister kernel: cat D 7fffffffffffffff 0 3978 3965
> > Jan 11 23:46:42 twister kernel: ffff81009e857ce8 0000000000000086 ffff8100ae7f2900 ffff81009e857ca8
> > Jan 11 23:46:42 twister kernel: ffff8100bd872000 ffff8100bf9e7060 ffffffff80652ae0 00000001000000c0
> > Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> > Jan 11 23:46:42 twister kernel: Call Trace:
> > Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> > Jan 11 23:46:42 twister kernel: [<ffffffff80223ccc>] __dequeue_entity+0x1c/0x32
> > Jan 11 23:46:42 twister kernel: [<ffffffff80223d05>] set_next_entity+0x23/0x73
> > Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> > Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> > Jan 11 23:46:42 twister kernel: [<ffffffff80236b02>] flush_cpu_workqueue+0x50/0x58
> > Jan 11 23:46:42 twister kernel: [<ffffffff80236c32>] wq_barrier_func+0x0/0x9
> > Jan 11 23:46:42 twister kernel: [<ffffffff80236ecf>] flush_workqueue+0x9/0x12
> > Jan 11 23:46:42 twister kernel: [<ffffffff803449f9>] release_dev+0x47c/0x5e2
> > Jan 11 23:46:42 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
> > Jan 11 23:46:42 twister kernel: [<ffffffff80344b6b>] tty_release+0xc/0x10
> > Jan 11 23:46:42 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
> > Jan 11 23:46:42 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
> > Jan 11 23:46:42 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
> > Jan 11 23:46:42 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
> > Jan 11 23:46:42 twister kernel:
> >
> > If I'm not mistaken the "cat" above is the "cat /dev/zero >
> > /dev/rfcomm0" command.
> >
>
> Any idea about this?
>
> Add marcel and davem in cc-list.

As a band-aid, you could use device_find_child() and then device_move()
on the found child. Repeat until no children are left.

2008-01-15 01:54:44

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Mon, Jan 14, 2008 at 01:52:28PM +0100, Cornelia Huck wrote:
> On Mon, 14 Jan 2008 15:05:19 +0800,
> "Dave Young" <[email protected]> wrote:
>
> > On Jan 12, 2008 7:09 AM, Gabor Gombas <[email protected]> wrote:
> > > On Thu, Jan 10, 2008 at 09:11:17AM +0800, Dave Young wrote:
> > >
> > > > For bluetooth device_move, the only child device of hci_conn dev is
> > > > the rfcomm tty_dev. How about the following patch, please verify :
> > >
> > > There is now no oops, instead the keyboard becomes almost completely
> > > unresponsible when I switch off & on the phone. The mouse still works
> > > (tested both with X and with the VGA console), but terminal input and VT
> > > switching is dead. On the VGA console, even scrolling using Shift+PgUp
> > > stops working.
> > >
> > > Alt+SysRq+w works, trace is below. Ctrl+Alt+Del also works, but then
> > > init complains about hung processes and that it can not umount the file
> > > systems and cannot stop the RAID arrays. Once it still rebooted though,
> > > the second time it got hung after trying to umount the filesystems, and
> > > I had to use Alt+SysRq+b.
> > >
> > > If I can choose then I prefer the Oops...
> > >
> > > Jan 11 23:46:35 twister kernel: SysRq : Emergency Sync
> > > Jan 11 23:46:35 twister kernel: Emergency Sync complete
> > > Jan 11 23:46:42 twister kernel: SysRq : Show Blocked State
> > > Jan 11 23:46:42 twister kernel: task PC stack pid father
> > > Jan 11 23:46:42 twister kernel: events/0 D ffffffff80487190 0 5 2
> > > Jan 11 23:46:42 twister kernel: ffff8100bf84fd40 0000000000000046 ffff8100ae7f2600 ffff8100bf84fd00
> > > Jan 11 23:46:42 twister kernel: ffff8100bf84a830 ffff8100bd872000 ffffffff80652ae0 0000000100000100
> > > Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> > > Jan 11 23:46:42 twister kernel: Call Trace:
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
> > > Jan 11 23:46:42 twister kernel: [<ffffffff802af3fd>] sysfs_move_dir+0x1ee/0x1fd
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80471d72>] klist_del+0x15/0x2e
> > > Jan 11 23:46:42 twister kernel: [<ffffffff803698f3>] device_move+0x80/0x111
> > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a27>] :bluetooth:hci_conn_move_child+0x0/0xf
> > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a32>] :bluetooth:hci_conn_move_child+0xb/0xf
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80369836>] device_for_each_child+0x22/0x4d
>
> The problem here is that device_for_each_child() will elevate the
> klist_node's refcount while the callback function calls device_move().
> device_move() calls klist_remove() on the same node, which will do a
> klist_del() and then wait for the refcount to drop to 0...

My wrong, it is the problem, thanks.

>
> > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a05>] :bluetooth:del_conn+0x0/0x22
> > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a1e>] :bluetooth:del_conn+0x19/0x22
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8023696d>] run_workqueue+0x74/0xee
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> > > Jan 11 23:46:42 twister kernel: [<ffffffff802370d2>] worker_thread+0xda/0xe7
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80239b77>] autoremove_wake_function+0x0/0x2e
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> > > Jan 11 23:46:42 twister kernel: [<ffffffff802399a0>] kthread+0x47/0x73
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8020bdc8>] child_rip+0xa/0x12
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80239959>] kthread+0x0/0x73
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8020bdbe>] child_rip+0x0/0x12
> > > Jan 11 23:46:42 twister kernel:
> > > Jan 11 23:46:42 twister kernel: cat D 7fffffffffffffff 0 3978 3965
> > > Jan 11 23:46:42 twister kernel: ffff81009e857ce8 0000000000000086 ffff8100ae7f2900 ffff81009e857ca8
> > > Jan 11 23:46:42 twister kernel: ffff8100bd872000 ffff8100bf9e7060 ffffffff80652ae0 00000001000000c0
> > > Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> > > Jan 11 23:46:42 twister kernel: Call Trace:
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80223ccc>] __dequeue_entity+0x1c/0x32
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80223d05>] set_next_entity+0x23/0x73
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80236b02>] flush_cpu_workqueue+0x50/0x58
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80236c32>] wq_barrier_func+0x0/0x9
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80236ecf>] flush_workqueue+0x9/0x12
> > > Jan 11 23:46:42 twister kernel: [<ffffffff803449f9>] release_dev+0x47c/0x5e2
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80344b6b>] tty_release+0xc/0x10
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
> > > Jan 11 23:46:42 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
> > > Jan 11 23:46:42 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
> > > Jan 11 23:46:42 twister kernel:
> > >
> > > If I'm not mistaken the "cat" above is the "cat /dev/zero >
> > > /dev/rfcomm0" command.
> > >
> >
> > Any idea about this?
> >
> > Add marcel and davem in cc-list.
>
> As a band-aid, you could use device_find_child() and then device_move()
> on the found child. Repeat until no children are left.

This one should fix the bug, Gabor, could you test?
---
The rfcomm tty device will possibly retain even when conn is down,
and sysfs doesn't support zombie device moving, so this patch
move the tty device before conn device is destroyed.

Signed-off-by: Dave Young <[email protected]>

---
net/bluetooth/hci_sysfs.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff -upr linux/net/bluetooth/hci_sysfs.c linux.new/net/bluetooth/hci_sysfs.c
--- linux/net/bluetooth/hci_sysfs.c 2008-01-15 09:36:47.000000000 +0800
+++ linux.new/net/bluetooth/hci_sysfs.c 2008-01-15 09:44:47.000000000 +0800
@@ -316,9 +316,24 @@ void hci_conn_add_sysfs(struct hci_conn
schedule_work(&conn->work);
}

+static int __match_tty(struct device *dev, void *data)
+{
+ /* The rfcomm tty device will possibly retain even when conn
+ * is down, and sysfs doesn't support move zombie device,
+ * so we should move the device before conn device is destroyed.
+ * Due to the only child device of hci_conn dev is rfcomm
+ * tty_dev, here just return 1
+ */
+ return 1;
+}
+
static void del_conn(struct work_struct *work)
{
+ struct device *dev;
struct hci_conn *conn = container_of(work, struct hci_conn, work);
+
+ while (dev = device_find_child(&conn->dev, NULL, __match_tty))
+ device_move(dev, NULL);
device_del(&conn->dev);
put_device(&conn->dev);
}

2008-01-16 00:59:17

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Tue, Jan 15, 2008 at 09:57:41AM +0800, Dave Young wrote:
> On Mon, Jan 14, 2008 at 01:52:28PM +0100, Cornelia Huck wrote:
> > On Mon, 14 Jan 2008 15:05:19 +0800,
> > "Dave Young" <[email protected]> wrote:
> >
> > > On Jan 12, 2008 7:09 AM, Gabor Gombas <[email protected]> wrote:
> > > > On Thu, Jan 10, 2008 at 09:11:17AM +0800, Dave Young wrote:
> > > >
> > > > > For bluetooth device_move, the only child device of hci_conn dev is
> > > > > the rfcomm tty_dev. How about the following patch, please verify :
> > > >
> > > > There is now no oops, instead the keyboard becomes almost completely
> > > > unresponsible when I switch off & on the phone. The mouse still works
> > > > (tested both with X and with the VGA console), but terminal input and VT
> > > > switching is dead. On the VGA console, even scrolling using Shift+PgUp
> > > > stops working.
> > > >
> > > > Alt+SysRq+w works, trace is below. Ctrl+Alt+Del also works, but then
> > > > init complains about hung processes and that it can not umount the file
> > > > systems and cannot stop the RAID arrays. Once it still rebooted though,
> > > > the second time it got hung after trying to umount the filesystems, and
> > > > I had to use Alt+SysRq+b.
> > > >
> > > > If I can choose then I prefer the Oops...
> > > >
> > > > Jan 11 23:46:35 twister kernel: SysRq : Emergency Sync
> > > > Jan 11 23:46:35 twister kernel: Emergency Sync complete
> > > > Jan 11 23:46:42 twister kernel: SysRq : Show Blocked State
> > > > Jan 11 23:46:42 twister kernel: task PC stack pid father
> > > > Jan 11 23:46:42 twister kernel: events/0 D ffffffff80487190 0 5 2
> > > > Jan 11 23:46:42 twister kernel: ffff8100bf84fd40 0000000000000046 ffff8100ae7f2600 ffff8100bf84fd00
> > > > Jan 11 23:46:42 twister kernel: ffff8100bf84a830 ffff8100bd872000 ffffffff80652ae0 0000000100000100
> > > > Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> > > > Jan 11 23:46:42 twister kernel: Call Trace:
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8028560d>] dput+0x26/0x103
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff802af3fd>] sysfs_move_dir+0x1ee/0x1fd
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80471d72>] klist_del+0x15/0x2e
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff803698f3>] device_move+0x80/0x111
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a27>] :bluetooth:hci_conn_move_child+0x0/0xf
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a32>] :bluetooth:hci_conn_move_child+0xb/0xf
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80369836>] device_for_each_child+0x22/0x4d
> >
> > The problem here is that device_for_each_child() will elevate the
> > klist_node's refcount while the callback function calls device_move().
> > device_move() calls klist_remove() on the same node, which will do a
> > klist_del() and then wait for the refcount to drop to 0...
>
> My wrong, it is the problem, thanks.
>
> >
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a05>] :bluetooth:del_conn+0x0/0x22
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff88146a1e>] :bluetooth:del_conn+0x19/0x22
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8023696d>] run_workqueue+0x74/0xee
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff802370d2>] worker_thread+0xda/0xe7
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80239b77>] autoremove_wake_function+0x0/0x2e
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80236ff8>] worker_thread+0x0/0xe7
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff802399a0>] kthread+0x47/0x73
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8020bdc8>] child_rip+0xa/0x12
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80239959>] kthread+0x0/0x73
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8020bdbe>] child_rip+0x0/0x12
> > > > Jan 11 23:46:42 twister kernel:
> > > > Jan 11 23:46:42 twister kernel: cat D 7fffffffffffffff 0 3978 3965
> > > > Jan 11 23:46:42 twister kernel: ffff81009e857ce8 0000000000000086 ffff8100ae7f2900 ffff81009e857ca8
> > > > Jan 11 23:46:42 twister kernel: ffff8100bd872000 ffff8100bf9e7060 ffffffff80652ae0 00000001000000c0
> > > > Jan 11 23:46:42 twister kernel: 0000000000000000 7fffffffffffffff 7fffffffffffffff 0000000000000002
> > > > Jan 11 23:46:42 twister kernel: Call Trace:
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80473543>] schedule_timeout+0x1e/0xad
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80223ccc>] __dequeue_entity+0x1c/0x32
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80223d05>] set_next_entity+0x23/0x73
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8047340b>] wait_for_common+0xc4/0x129
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80224204>] default_wake_function+0x0/0xe
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80236b02>] flush_cpu_workqueue+0x50/0x58
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80236c32>] wq_barrier_func+0x0/0x9
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80236ecf>] flush_workqueue+0x9/0x12
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff803449f9>] release_dev+0x47c/0x5e2
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8021b609>] do_page_fault+0x2ff/0x65a
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80344b6b>] tty_release+0xc/0x10
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80277013>] __fput+0xb1/0x16f
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80274961>] filp_close+0x5d/0x65
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff80275ab3>] sys_close+0x73/0xa6
> > > > Jan 11 23:46:42 twister kernel: [<ffffffff8020b5fc>] tracesys+0xdc/0xe1
> > > > Jan 11 23:46:42 twister kernel:
> > > >
> > > > If I'm not mistaken the "cat" above is the "cat /dev/zero >
> > > > /dev/rfcomm0" command.
> > > >
> > >
> > > Any idea about this?
> > >
> > > Add marcel and davem in cc-list.
> >
> > As a band-aid, you could use device_find_child() and then device_move()
> > on the found child. Repeat until no children are left.
>
> This one should fix the bug, Gabor, could you test?
> ---
> The rfcomm tty device will possibly retain even when conn is down,
> and sysfs doesn't support zombie device moving, so this patch
> move the tty device before conn device is destroyed.
>
> Signed-off-by: Dave Young <[email protected]>
>
> ---
> net/bluetooth/hci_sysfs.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff -upr linux/net/bluetooth/hci_sysfs.c linux.new/net/bluetooth/hci_sysfs.c
> --- linux/net/bluetooth/hci_sysfs.c 2008-01-15 09:36:47.000000000 +0800
> +++ linux.new/net/bluetooth/hci_sysfs.c 2008-01-15 09:44:47.000000000 +0800
> @@ -316,9 +316,24 @@ void hci_conn_add_sysfs(struct hci_conn
> schedule_work(&conn->work);
> }
>
> +static int __match_tty(struct device *dev, void *data)
> +{
> + /* The rfcomm tty device will possibly retain even when conn
> + * is down, and sysfs doesn't support move zombie device,
> + * so we should move the device before conn device is destroyed.
> + * Due to the only child device of hci_conn dev is rfcomm
> + * tty_dev, here just return 1
> + */
> + return 1;
> +}
> +
> static void del_conn(struct work_struct *work)
> {
> + struct device *dev;
> struct hci_conn *conn = container_of(work, struct hci_conn, work);
> +
> + while (dev = device_find_child(&conn->dev, NULL, __match_tty))
> + device_move(dev, NULL);
> device_del(&conn->dev);
> put_device(&conn->dev);
> }

Sorry, should be this:

The rfcomm tty device will possibly retain even when conn is down,
and sysfs doesn't support zombie device moving, so this patch
move the tty device before conn device is destroyed.

Signed-off-by: Dave Young <[email protected]>

---
diff -upr linux/net/bluetooth/hci_sysfs.c linux.new/net/bluetooth/hci_sysfs.c
--- linux/net/bluetooth/hci_sysfs.c 2008-01-15 09:36:47.000000000 +0800
+++ linux.new/net/bluetooth/hci_sysfs.c 2008-01-16 08:58:03.000000000 +0800
@@ -316,9 +316,26 @@ void hci_conn_add_sysfs(struct hci_conn
schedule_work(&conn->work);
}

+static int __match_tty(struct device *dev, void *data)
+{
+ /* The rfcomm tty device will possibly retain even when conn
+ * is down, and sysfs doesn't support move zombie device,
+ * so we should move the device before conn device is destroyed.
+ * Due to the only child device of hci_conn dev is rfcomm
+ * tty_dev, here just return 1
+ */
+ return 1;
+}
+
static void del_conn(struct work_struct *work)
{
+ struct device *dev;
struct hci_conn *conn = container_of(work, struct hci_conn, work);
+
+ while(dev = device_find_child(&conn->dev, NULL, __match_tty)) {
+ device_move(dev, NULL);
+ put_device(dev);
+ }
device_del(&conn->dev);
put_device(&conn->dev);
}

2008-01-16 23:06:59

by Gabor Gombas

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

Hi,

On Wed, Jan 16, 2008 at 09:02:05AM +0800, Dave Young wrote:

> The rfcomm tty device will possibly retain even when conn is down,
> and sysfs doesn't support zombie device moving, so this patch
> move the tty device before conn device is destroyed.
>
> Signed-off-by: Dave Young <[email protected]>

This seems to work, both the oops and the hang are gone. I get these
messages in syslog when the Bluetooth link hangs and I want to kill pppd
with "poff":

Jan 16 23:55:59 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
Jan 16 23:56:09 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1

But a "killall -9 pppd" seems to help and then the re-connect (after the
phone got power-cycled) works.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2008-01-17 07:25:00

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Jan 17, 2008 7:06 AM, Gabor Gombas <[email protected]> wrote:
> Hi,
>
> On Wed, Jan 16, 2008 at 09:02:05AM +0800, Dave Young wrote:
>
> > The rfcomm tty device will possibly retain even when conn is down,
> > and sysfs doesn't support zombie device moving, so this patch
> > move the tty device before conn device is destroyed.
> >
> > Signed-off-by: Dave Young <[email protected]>
>
> This seems to work, both the oops and the hang are gone. I get these
> messages in syslog when the Bluetooth link hangs and I want to kill pppd
> with "poff":
>
> Jan 16 23:55:59 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
> Jan 16 23:56:09 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
>
> But a "killall -9 pppd" seems to help and then the re-connect (after the
> phone got power-cycled) works.

Weird, I guess "device_move(dev, NULL) two times" cause the problem.

Anyway, device_move should check the old_parent and new_parent , if
they equal to each other then just return.

Am I right?

>
>
> Gabor
>
> --
> ---------------------------------------------------------
> MTA SZTAKI Computer and Automation Research Institute
> Hungarian Academy of Sciences
> ---------------------------------------------------------
>

2008-01-17 08:12:19

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Thu, Jan 17, 2008 at 03:24:50PM +0800, Dave Young wrote:
> On Jan 17, 2008 7:06 AM, Gabor Gombas <[email protected]> wrote:
> > Hi,
> >
> > On Wed, Jan 16, 2008 at 09:02:05AM +0800, Dave Young wrote:
> >
> > > The rfcomm tty device will possibly retain even when conn is down,
> > > and sysfs doesn't support zombie device moving, so this patch
> > > move the tty device before conn device is destroyed.
> > >
> > > Signed-off-by: Dave Young <[email protected]>
> >
> > This seems to work, both the oops and the hang are gone. I get these
> > messages in syslog when the Bluetooth link hangs and I want to kill pppd
> > with "poff":
> >
> > Jan 16 23:55:59 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
> > Jan 16 23:56:09 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
> >
> > But a "killall -9 pppd" seems to help and then the re-connect (after the
> > phone got power-cycled) works.
>
> Weird, I guess "device_move(dev, NULL) two times" cause the problem.
>
> Anyway, device_move should check the old_parent and new_parent , if
> they equal to each other then just return.
>
> Am I right?

Could you apply this patch as well to test? Thanks.

diff -upr linux/net/bluetooth/rfcomm/tty.c linux.new/net/bluetooth/rfcomm/tty.c
--- linux/net/bluetooth/rfcomm/tty.c 2008-01-17 16:09:34.000000000 +0800
+++ linux.new/net/bluetooth/rfcomm/tty.c 2008-01-17 16:10:22.000000000 +0800
@@ -692,7 +692,8 @@ static void rfcomm_tty_close(struct tty_
BT_DBG("tty %p dev %p dlc %p opened %d", tty, dev, dev->dlc, dev->opened);

if (--dev->opened == 0) {
- device_move(dev->tty_dev, NULL);
+ if (dev->tty_dev->parent)
+ device_move(dev->tty_dev, NULL);

/* Close DLC and dettach TTY */
rfcomm_dlc_close(dev->dlc, 0);

>
> >
> >
> > Gabor
> >
> > --
> > ---------------------------------------------------------
> > MTA SZTAKI Computer and Automation Research Institute
> > Hungarian Academy of Sciences
> > ---------------------------------------------------------
> >

2008-01-17 11:43:33

by Cornelia Huck

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Thu, 17 Jan 2008 16:15:04 +0800,
Dave Young <[email protected]> wrote:

> On Thu, Jan 17, 2008 at 03:24:50PM +0800, Dave Young wrote:
> > On Jan 17, 2008 7:06 AM, Gabor Gombas <[email protected]> wrote:
> > > Hi,
> > >
> > > On Wed, Jan 16, 2008 at 09:02:05AM +0800, Dave Young wrote:
> > >
> > > > The rfcomm tty device will possibly retain even when conn is down,
> > > > and sysfs doesn't support zombie device moving, so this patch
> > > > move the tty device before conn device is destroyed.
> > > >
> > > > Signed-off-by: Dave Young <[email protected]>
> > >
> > > This seems to work, both the oops and the hang are gone. I get these
> > > messages in syslog when the Bluetooth link hangs and I want to kill pppd
> > > with "poff":
> > >
> > > Jan 16 23:55:59 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
> > > Jan 16 23:56:09 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1

Might be helpful to try with CONFIG_DEBUG_DRIVER=y and
CONFIG_KOBJECT_DEBUG=y to spot where a reference is not dropped.

> > >
> > > But a "killall -9 pppd" seems to help and then the re-connect (after the
> > > phone got power-cycled) works.
> >
> > Weird, I guess "device_move(dev, NULL) two times" cause the problem.
> >
> > Anyway, device_move should check the old_parent and new_parent , if
> > they equal to each other then just return.

sysfs_move_dir() does this (to avoid a loop later in the function).
Don't know if that's a good thing to check at a higher level as well,
because the calling code should really know where their devices are.

Anyway, if this "move in place" exposes a refcounting bug, we must fix
it.

> >
> > Am I right?
>
> Could you apply this patch as well to test? Thanks.
>
> diff -upr linux/net/bluetooth/rfcomm/tty.c linux.new/net/bluetooth/rfcomm/tty.c
> --- linux/net/bluetooth/rfcomm/tty.c 2008-01-17 16:09:34.000000000 +0800
> +++ linux.new/net/bluetooth/rfcomm/tty.c 2008-01-17 16:10:22.000000000 +0800
> @@ -692,7 +692,8 @@ static void rfcomm_tty_close(struct tty_
> BT_DBG("tty %p dev %p dlc %p opened %d", tty, dev, dev->dlc, dev->opened);
>
> if (--dev->opened == 0) {
> - device_move(dev->tty_dev, NULL);
> + if (dev->tty_dev->parent)
> + device_move(dev->tty_dev, NULL);
>
> /* Close DLC and dettach TTY */
> rfcomm_dlc_close(dev->dlc, 0);
>

Avoiding to move devices when there is nothing to be moved nevertheless
sounds like a good idea :)

2008-01-18 03:37:34

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Jan 17, 2008 7:42 PM, Cornelia Huck <[email protected]> wrote:
> On Thu, 17 Jan 2008 16:15:04 +0800,
> Dave Young <[email protected]> wrote:
>
> > On Thu, Jan 17, 2008 at 03:24:50PM +0800, Dave Young wrote:
> > > On Jan 17, 2008 7:06 AM, Gabor Gombas <[email protected]> wrote:
> > > > Hi,
> > > >
> > > > On Wed, Jan 16, 2008 at 09:02:05AM +0800, Dave Young wrote:
> > > >
> > > > > The rfcomm tty device will possibly retain even when conn is down,
> > > > > and sysfs doesn't support zombie device moving, so this patch
> > > > > move the tty device before conn device is destroyed.
> > > > >
> > > > > Signed-off-by: Dave Young <[email protected]>
> > > >
> > > > This seems to work, both the oops and the hang are gone. I get these
> > > > messages in syslog when the Bluetooth link hangs and I want to kill pppd
> > > > with "poff":
> > > >
> > > > Jan 16 23:55:59 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
> > > > Jan 16 23:56:09 twister kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1
>
> Might be helpful to try with CONFIG_DEBUG_DRIVER=y and
> CONFIG_KOBJECT_DEBUG=y to spot where a reference is not dropped.
>
> > > >
> > > > But a "killall -9 pppd" seems to help and then the re-connect (after the
> > > > phone got power-cycled) works.
> > >
> > > Weird, I guess "device_move(dev, NULL) two times" cause the problem.
> > >
> > > Anyway, device_move should check the old_parent and new_parent , if
> > > they equal to each other then just return.
>
> sysfs_move_dir() does this (to avoid a loop later in the function).
> Don't know if that's a good thing to check at a higher level as well,
> because the calling code should really know where their devices are.
>
> Anyway, if this "move in place" exposes a refcounting bug, we must fix
> it.

Lets see the device_move function, seems there's some problems in it:

1302 int device_move(struct device *dev, struct device *new_parent)
1303 {
1304 int error;
1305 struct device *old_parent;
1306 struct kobject *new_parent_kobj;
1307
1308 dev = get_device(dev);
1309 if (!dev)
1310 return -EINVAL;
1311
1312 new_parent = get_device(new_parent);
1313 new_parent_kobj = get_device_parent (dev, new_parent);

Here could get kobject reference

1314 if (IS_ERR(new_parent_kobj)) {
1315 error = PTR_ERR(new_parent_kobj);
1316 put_device(new_parent);
1317 goto out;
1318 }
1319 pr_debug("DEVICE: moving '%s' to '%s'\n", dev->bus_id,
1320 new_parent ? new_parent->bus_id : "<NULL>");
1321 error = kobject_move(&dev->kobj, new_parent_kobj);
1322 if (error) {
1323 put_device(new_parent);

imagine new_parent is NULL, then the new_parent_kobj should be put

1324 goto out;
1325 }
1326 old_parent = dev->parent;
1327 dev->parent = new_parent;
1328 if (old_parent)
1329 klist_remove(&dev->knode_parent);
1330 if (new_parent)
1331 klist_add_tail(&dev->knode_parent,
&new_parent->klist_children);
1332 if (!dev->class)
1333 goto out_put;

Why not put new_parent | new_parent_kobj?

1334 error = device_move_class_links(dev, old_parent, new_parent);
1335 if (error) {
1336 /* We ignore errors on cleanup since we're hosed
anyway... */
1337 device_move_class_links(dev, new_parent, old_parent);
1338 if (!kobject_move(&dev->kobj, &old_parent->kobj)) {
1339 if (new_parent)
1340 klist_remove(&dev->knode_parent);
1341 if (old_parent)
1342 klist_add_tail(&dev->knode_parent,
1343
&old_parent->klist_children);
1344 }
1345 put_device(new_parent);

Same doubt as above

1346 goto out;
1347 }
1348 out_put:
1349 put_device(old_parent);
1350 out:
1351 put_device(dev);
1352 return error;
1353 }

Hope I'm wrong, but if it's indeed bugs, I will send a patch about this.

>
> > >
> > > Am I right?
> >
> > Could you apply this patch as well to test? Thanks.
> >
> > diff -upr linux/net/bluetooth/rfcomm/tty.c linux.new/net/bluetooth/rfcomm/tty.c
> > --- linux/net/bluetooth/rfcomm/tty.c 2008-01-17 16:09:34.000000000 +0800
> > +++ linux.new/net/bluetooth/rfcomm/tty.c 2008-01-17 16:10:22.000000000 +0800
> > @@ -692,7 +692,8 @@ static void rfcomm_tty_close(struct tty_
> > BT_DBG("tty %p dev %p dlc %p opened %d", tty, dev, dev->dlc, dev->opened);
> >
> > if (--dev->opened == 0) {
> > - device_move(dev->tty_dev, NULL);
> > + if (dev->tty_dev->parent)
> > + device_move(dev->tty_dev, NULL);
> >
> > /* Close DLC and dettach TTY */
> > rfcomm_dlc_close(dev->dlc, 0);
> >
>
> Avoiding to move devices when there is nothing to be moved nevertheless
> sounds like a good idea :)
>

2008-01-18 09:20:18

by Cornelia Huck

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Fri, 18 Jan 2008 11:37:21 +0800,
"Dave Young" <[email protected]> wrote:


> Lets see the device_move function, seems there's some problems in it:
>
> 1302 int device_move(struct device *dev, struct device *new_parent)
> 1303 {
> 1304 int error;
> 1305 struct device *old_parent;
> 1306 struct kobject *new_parent_kobj;
> 1307
> 1308 dev = get_device(dev);
> 1309 if (!dev)
> 1310 return -EINVAL;
> 1311
> 1312 new_parent = get_device(new_parent);
> 1313 new_parent_kobj = get_device_parent (dev, new_parent);
>
> Here could get kobject reference

Eww. get_device_parent() may inflate the refcount in one case
for !CONFIG_SYSFS_DEPRECATED, but often won't. (And the function is
named confusingly, since it hints that we always get a reference, which
we don't.)

>
> 1314 if (IS_ERR(new_parent_kobj)) {
> 1315 error = PTR_ERR(new_parent_kobj);
> 1316 put_device(new_parent);
> 1317 goto out;
> 1318 }
> 1319 pr_debug("DEVICE: moving '%s' to '%s'\n", dev->bus_id,
> 1320 new_parent ? new_parent->bus_id : "<NULL>");
> 1321 error = kobject_move(&dev->kobj, new_parent_kobj);
> 1322 if (error) {
> 1323 put_device(new_parent);
>
> imagine new_parent is NULL, then the new_parent_kobj should be put

No, we would need a put_device_parent() (crappy name) which puts the
reference iff get_device_parent() grabbed it.

>
> 1324 goto out;
> 1325 }
> 1326 old_parent = dev->parent;
> 1327 dev->parent = new_parent;
> 1328 if (old_parent)
> 1329 klist_remove(&dev->knode_parent);
> 1330 if (new_parent)
> 1331 klist_add_tail(&dev->knode_parent,
> &new_parent->klist_children);
> 1332 if (!dev->class)
> 1333 goto out_put;
>
> Why not put new_parent | new_parent_kobj?

Because that is the good case :)

>
> 1334 error = device_move_class_links(dev, old_parent, new_parent);
> 1335 if (error) {
> 1336 /* We ignore errors on cleanup since we're hosed
> anyway... */
> 1337 device_move_class_links(dev, new_parent, old_parent);
> 1338 if (!kobject_move(&dev->kobj, &old_parent->kobj)) {
> 1339 if (new_parent)
> 1340 klist_remove(&dev->knode_parent);
> 1341 if (old_parent)
> 1342 klist_add_tail(&dev->knode_parent,
> 1343
> &old_parent->klist_children);
> 1344 }
> 1345 put_device(new_parent);
>
> Same doubt as above

We'd need put_device_parent() or whatever here as well, I guess.

>
> 1346 goto out;
> 1347 }
> 1348 out_put:
> 1349 put_device(old_parent);
> 1350 out:
> 1351 put_device(dev);
> 1352 return error;
> 1353 }
>
> Hope I'm wrong, but if it's indeed bugs, I will send a patch about this.

There are more problems, I'm afraid :( setup_parent() calls
get_device_parent() as well, so device_add() has the same problems on
error cleanup...

I'll take a look at it if I find some time, but I'm afraid I'll not
be able to do so before next week.

2008-01-18 10:23:51

by Cornelia Huck

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Fri, 18 Jan 2008 10:19:33 +0100,
Cornelia Huck <[email protected]> wrote:

> >
> > 1314 if (IS_ERR(new_parent_kobj)) {
> > 1315 error = PTR_ERR(new_parent_kobj);
> > 1316 put_device(new_parent);
> > 1317 goto out;
> > 1318 }
> > 1319 pr_debug("DEVICE: moving '%s' to '%s'\n", dev->bus_id,
> > 1320 new_parent ? new_parent->bus_id : "<NULL>");
> > 1321 error = kobject_move(&dev->kobj, new_parent_kobj);
> > 1322 if (error) {
> > 1323 put_device(new_parent);
> >
> > imagine new_parent is NULL, then the new_parent_kobj should be put
>
> No, we would need a put_device_parent() (crappy name) which puts the
> reference iff get_device_parent() grabbed it.

And looking at Greg's patchset, it has cleanup_device_parent(), which
does just that. But it is only called in device_del(), not when
device_move() has errors.

(get_device_parent() also always returns a pointer to a kobject or
NULL, so we can get rid of those IS_ERR() checks in setup_parent() and
device_move() as well.)

2008-01-18 10:35:22

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Jan 18, 2008 6:23 PM, Cornelia Huck <[email protected]> wrote:
> On Fri, 18 Jan 2008 10:19:33 +0100,
> Cornelia Huck <[email protected]> wrote:
>
> > >
> > > 1314 if (IS_ERR(new_parent_kobj)) {
> > > 1315 error = PTR_ERR(new_parent_kobj);
> > > 1316 put_device(new_parent);
> > > 1317 goto out;
> > > 1318 }
> > > 1319 pr_debug("DEVICE: moving '%s' to '%s'\n", dev->bus_id,
> > > 1320 new_parent ? new_parent->bus_id : "<NULL>");
> > > 1321 error = kobject_move(&dev->kobj, new_parent_kobj);
> > > 1322 if (error) {
> > > 1323 put_device(new_parent);
> > >
> > > imagine new_parent is NULL, then the new_parent_kobj should be put
> >
> > No, we would need a put_device_parent() (crappy name) which puts the
> > reference iff get_device_parent() grabbed it.
>
> And looking at Greg's patchset, it has cleanup_device_parent(), which
> does just that. But it is only called in device_del(), not when
> device_move() has errors.
>
> (get_device_parent() also always returns a pointer to a kobject or
> NULL, so we can get rid of those IS_ERR() checks in setup_parent() and
> device_move() as well.)
>

Hmm, thanks.
I will be offline during weekend, but I will still check the
device_move and other code if I have time.

2008-01-18 11:27:47

by Cornelia Huck

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Fri, 18 Jan 2008 18:34:55 +0800,
"Dave Young" <[email protected]> wrote:

> On Jan 18, 2008 6:23 PM, Cornelia Huck <[email protected]> wrote:
> > On Fri, 18 Jan 2008 10:19:33 +0100,
> > Cornelia Huck <[email protected]> wrote:
> >
> > > >
> > > > 1314 if (IS_ERR(new_parent_kobj)) {
> > > > 1315 error = PTR_ERR(new_parent_kobj);
> > > > 1316 put_device(new_parent);
> > > > 1317 goto out;
> > > > 1318 }
> > > > 1319 pr_debug("DEVICE: moving '%s' to '%s'\n", dev->bus_id,
> > > > 1320 new_parent ? new_parent->bus_id : "<NULL>");
> > > > 1321 error = kobject_move(&dev->kobj, new_parent_kobj);
> > > > 1322 if (error) {
> > > > 1323 put_device(new_parent);
> > > >
> > > > imagine new_parent is NULL, then the new_parent_kobj should be put
> > >
> > > No, we would need a put_device_parent() (crappy name) which puts the
> > > reference iff get_device_parent() grabbed it.
> >
> > And looking at Greg's patchset, it has cleanup_device_parent(), which
> > does just that. But it is only called in device_del(), not when
> > device_move() has errors.
> >
> > (get_device_parent() also always returns a pointer to a kobject or
> > NULL, so we can get rid of those IS_ERR() checks in setup_parent() and
> > device_move() as well.)
> >
>
> Hmm, thanks.
> I will be offline during weekend, but I will still check the
> device_move and other code if I have time.

Just hacked together the following against Greg's tree:

COMPLETELY UNTESTED DO NOT APPLY

---
drivers/base/core.c | 33 ++++++++++++++++-----------------
1 files changed, 16 insertions(+), 17 deletions(-)

--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -553,6 +553,8 @@ static struct kobject *get_device_parent
}

static inline void cleanup_device_parent(struct device *dev) {}
+static inline void cleanup_glue_dir(struct device *dev,
+ struct kobject *kobj) {}
#else
static struct kobject *virtual_device_parent(struct device *dev)
{
@@ -617,27 +619,27 @@ static struct kobject *get_device_parent
return NULL;
}

-static void cleanup_device_parent(struct device *dev)
+static void cleanup_glue_dir(struct device *dev, struct kobject *kobj)
{
- struct kobject *glue_dir = dev->kobj.parent;
-
/* see if we live in a "glue" directory */
- if (!dev->class || glue_dir->kset != &dev->class->class_dirs)
+ if (!dev->class || kobj->kset != &dev->class->class_dirs)
return;

- kobject_put(glue_dir);
+ kobject_put(kobj);
+}
+
+static void cleanup_device_parent(struct device *dev)
+{
+ cleanup_glue_dir(dev, dev->kobj.parent);
}
#endif

-static int setup_parent(struct device *dev, struct device *parent)
+static void setup_parent(struct device *dev, struct device *parent)
{
struct kobject *kobj;
kobj = get_device_parent(dev, parent);
- if (IS_ERR(kobj))
- return PTR_ERR(kobj);
if (kobj)
dev->kobj.parent = kobj;
- return 0;
}

static int device_add_class_symlinks(struct device *dev)
@@ -782,9 +784,7 @@ int device_add(struct device *dev)
pr_debug("device: '%s': %s\n", dev->bus_id, __FUNCTION__);

parent = get_device(dev->parent);
- error = setup_parent(dev, parent);
- if (error)
- goto Error;
+ setup_parent(dev, parent);

/* first, register with generic layer. */
error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
@@ -862,6 +862,7 @@ int device_add(struct device *dev)
kobject_uevent(&dev->kobj, KOBJ_REMOVE);
kobject_del(&dev->kobj);
Error:
+ cleanup_device_parent(dev);
if (parent)
put_device(parent);
goto Done;
@@ -1303,15 +1304,12 @@ int device_move(struct device *dev, stru

new_parent = get_device(new_parent);
new_parent_kobj = get_device_parent (dev, new_parent);
- if (IS_ERR(new_parent_kobj)) {
- error = PTR_ERR(new_parent_kobj);
- put_device(new_parent);
- goto out;
- }
+
pr_debug("device: '%s': %s: moving to '%s'\n", dev->bus_id,
__FUNCTION__, new_parent ? new_parent->bus_id : "<NULL>");
error = kobject_move(&dev->kobj, new_parent_kobj);
if (error) {
+ cleanup_glue_dir(dev, new_parent_kobj);
put_device(new_parent);
goto out;
}
@@ -1334,6 +1332,7 @@ int device_move(struct device *dev, stru
klist_add_tail(&dev->knode_parent,
&old_parent->klist_children);
}
+ cleanup_glue_dir(dev, new_parent_kobj);
put_device(new_parent);
goto out;
}

2008-01-21 03:16:01

by Dave Young

[permalink] [raw]
Subject: Re: [Bluez-devel] Oops involving RFCOMM and sysfs

On Jan 18, 2008 7:26 PM, Cornelia Huck <[email protected]> wrote:
> On Fri, 18 Jan 2008 18:34:55 +0800,
> "Dave Young" <[email protected]> wrote:
>
>
> > On Jan 18, 2008 6:23 PM, Cornelia Huck <[email protected]> wrote:
> > > On Fri, 18 Jan 2008 10:19:33 +0100,
> > > Cornelia Huck <[email protected]> wrote:
> > >
> > > > >
> > > > > 1314 if (IS_ERR(new_parent_kobj)) {
> > > > > 1315 error = PTR_ERR(new_parent_kobj);
> > > > > 1316 put_device(new_parent);
> > > > > 1317 goto out;
> > > > > 1318 }
> > > > > 1319 pr_debug("DEVICE: moving '%s' to '%s'\n", dev->bus_id,
> > > > > 1320 new_parent ? new_parent->bus_id : "<NULL>");
> > > > > 1321 error = kobject_move(&dev->kobj, new_parent_kobj);
> > > > > 1322 if (error) {
> > > > > 1323 put_device(new_parent);
> > > > >
> > > > > imagine new_parent is NULL, then the new_parent_kobj should be put
> > > >
> > > > No, we would need a put_device_parent() (crappy name) which puts the
> > > > reference iff get_device_parent() grabbed it.
> > >
> > > And looking at Greg's patchset, it has cleanup_device_parent(), which
> > > does just that. But it is only called in device_del(), not when
> > > device_move() has errors.
> > >
> > > (get_device_parent() also always returns a pointer to a kobject or
> > > NULL, so we can get rid of those IS_ERR() checks in setup_parent() and
> > > device_move() as well.)
> > >
> >
> > Hmm, thanks.
> > I will be offline during weekend, but I will still check the
> > device_move and other code if I have time.
>
> Just hacked together the following against Greg's tree:
>
> COMPLETELY UNTESTED DO NOT APPLY
>
> ---
> drivers/base/core.c | 33 ++++++++++++++++-----------------
> 1 files changed, 16 insertions(+), 17 deletions(-)
>
> --- linux-2.6.orig/drivers/base/core.c
> +++ linux-2.6/drivers/base/core.c
> @@ -553,6 +553,8 @@ static struct kobject *get_device_parent
> }
>
> static inline void cleanup_device_parent(struct device *dev) {}
> +static inline void cleanup_glue_dir(struct device *dev,
> + struct kobject *kobj) {}
> #else
> static struct kobject *virtual_device_parent(struct device *dev)
> {
> @@ -617,27 +619,27 @@ static struct kobject *get_device_parent
> return NULL;
> }
>
> -static void cleanup_device_parent(struct device *dev)
> +static void cleanup_glue_dir(struct device *dev, struct kobject *kobj)
> {
> - struct kobject *glue_dir = dev->kobj.parent;
> -
> /* see if we live in a "glue" directory */
> - if (!dev->class || glue_dir->kset != &dev->class->class_dirs)
> + if (!dev->class || kobj->kset != &dev->class->class_dirs)
> return;
>
> - kobject_put(glue_dir);
> + kobject_put(kobj);
> +}
> +
> +static void cleanup_device_parent(struct device *dev)
> +{
> + cleanup_glue_dir(dev, dev->kobj.parent);
> }
> #endif
>
> -static int setup_parent(struct device *dev, struct device *parent)
> +static void setup_parent(struct device *dev, struct device *parent)
> {
> struct kobject *kobj;
> kobj = get_device_parent(dev, parent);
> - if (IS_ERR(kobj))
> - return PTR_ERR(kobj);
> if (kobj)
> dev->kobj.parent = kobj;
> - return 0;
> }
>
> static int device_add_class_symlinks(struct device *dev)
> @@ -782,9 +784,7 @@ int device_add(struct device *dev)
> pr_debug("device: '%s': %s\n", dev->bus_id, __FUNCTION__);
>
> parent = get_device(dev->parent);
> - error = setup_parent(dev, parent);
> - if (error)
> - goto Error;
> + setup_parent(dev, parent);
>
> /* first, register with generic layer. */
> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
> @@ -862,6 +862,7 @@ int device_add(struct device *dev)
> kobject_uevent(&dev->kobj, KOBJ_REMOVE);
> kobject_del(&dev->kobj);
> Error:
> + cleanup_device_parent(dev);
> if (parent)
> put_device(parent);
> goto Done;
> @@ -1303,15 +1304,12 @@ int device_move(struct device *dev, stru
>
> new_parent = get_device(new_parent);
> new_parent_kobj = get_device_parent (dev, new_parent);
> - if (IS_ERR(new_parent_kobj)) {
> - error = PTR_ERR(new_parent_kobj);
> - put_device(new_parent);
> - goto out;
> - }
> +
> pr_debug("device: '%s': %s: moving to '%s'\n", dev->bus_id,
> __FUNCTION__, new_parent ? new_parent->bus_id : "<NULL>");
> error = kobject_move(&dev->kobj, new_parent_kobj);
> if (error) {
> + cleanup_glue_dir(dev, new_parent_kobj);
> put_device(new_parent);
> goto out;
> }
> @@ -1334,6 +1332,7 @@ int device_move(struct device *dev, stru
> klist_add_tail(&dev->knode_parent,
> &old_parent->klist_children);
> }
> + cleanup_glue_dir(dev, new_parent_kobj);
> put_device(new_parent);
> goto out;
> }
>

Looks good.

2008-01-21 15:09:59

by Cornelia Huck

[permalink] [raw]
Subject: [Patch] Driver core: Cleanup get_device_parent() in device_add() and device_move().

Make setup_parent() void as get_device_parent() will always return
either a valid kobject or NULL.
Introduce cleanup_glue_dir() to drop reference grabbed on "glue"
directory by get_device_parent(). Use it for cleanup in device_move()
and device_add() on errors.

This should fix the refcounting problem reported in
http://marc.info/?l=linux-kernel&m=120052487909200&w=2

Signed-off-by: Cornelia Huck <[email protected]>

---
drivers/base/core.c | 29 ++++++++++++++---------------
1 files changed, 14 insertions(+), 15 deletions(-)

--- linux-2.6.orig/drivers/base/core.c
+++ linux-2.6/drivers/base/core.c
@@ -553,6 +553,8 @@ static struct kobject *get_device_parent
}

static inline void cleanup_device_parent(struct device *dev) {}
+static inline void cleanup_glue_dir(struct device *dev,
+ struct kobject *glue_dir) {}
#else
static struct kobject *virtual_device_parent(struct device *dev)
{
@@ -617,27 +619,27 @@ static struct kobject *get_device_parent
return NULL;
}

-static void cleanup_device_parent(struct device *dev)
+static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
{
- struct kobject *glue_dir = dev->kobj.parent;
-
/* see if we live in a "glue" directory */
if (!dev->class || glue_dir->kset != &dev->class->class_dirs)
return;

kobject_put(glue_dir);
}
+
+static void cleanup_device_parent(struct device *dev)
+{
+ cleanup_glue_dir(dev, dev->kobj.parent);
+}
#endif

-static int setup_parent(struct device *dev, struct device *parent)
+static void setup_parent(struct device *dev, struct device *parent)
{
struct kobject *kobj;
kobj = get_device_parent(dev, parent);
- if (IS_ERR(kobj))
- return PTR_ERR(kobj);
if (kobj)
dev->kobj.parent = kobj;
- return 0;
}

static int device_add_class_symlinks(struct device *dev)
@@ -785,9 +787,7 @@ int device_add(struct device *dev)
pr_debug("device: '%s': %s\n", dev->bus_id, __FUNCTION__);

parent = get_device(dev->parent);
- error = setup_parent(dev, parent);
- if (error)
- goto Error;
+ setup_parent(dev, parent);

/* first, register with generic layer. */
error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
@@ -865,6 +865,7 @@ int device_add(struct device *dev)
kobject_uevent(&dev->kobj, KOBJ_REMOVE);
kobject_del(&dev->kobj);
Error:
+ cleanup_device_parent(dev);
if (parent)
put_device(parent);
goto Done;
@@ -1345,15 +1346,12 @@ int device_move(struct device *dev, stru

new_parent = get_device(new_parent);
new_parent_kobj = get_device_parent (dev, new_parent);
- if (IS_ERR(new_parent_kobj)) {
- error = PTR_ERR(new_parent_kobj);
- put_device(new_parent);
- goto out;
- }
+
pr_debug("device: '%s': %s: moving to '%s'\n", dev->bus_id,
__FUNCTION__, new_parent ? new_parent->bus_id : "<NULL>");
error = kobject_move(&dev->kobj, new_parent_kobj);
if (error) {
+ cleanup_glue_dir(dev, new_parent_kobj);
put_device(new_parent);
goto out;
}
@@ -1376,6 +1374,7 @@ int device_move(struct device *dev, stru
klist_add_tail(&dev->knode_parent,
&old_parent->klist_children);
}
+ cleanup_glue_dir(dev, new_parent_kobj);
put_device(new_parent);
goto out;
}