Return-Path: Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH V1 0/3] hci_ldisc: inhibit "proto" function pointers, avoid crash From: Marcel Holtmann In-Reply-To: <1493384246-15381-1-git-send-email-Dean_Jenkins@mentor.com> Date: Fri, 28 Apr 2017 18:09:09 +0200 Cc: "Gustavo F. Padovan" , Johan Hedberg , linux-bluetooth@vger.kernel.org Message-Id: <6F8CAAD2-3257-4059-9E58-48BFA7FAE234@holtmann.org> References: <1493384246-15381-1-git-send-email-Dean_Jenkins@mentor.com> To: Dean Jenkins Sender: linux-bluetooth-owner@vger.kernel.org List-ID: Hi Dean, > This is my 2nd patchset in my series of resolving a design flaw in > hci_uart_tty_close() related to the proper closure of the Data Link protocol > layer which can result in a kernel crash. > > The 3 patches in this patchset will unmask the design flaw in > hci_uart_tty_close() and provide a partial fix. My future patchsets based on > my original 1st patchset V2 can be discussed at a later date to get a complete > fix. > > > Previous Discussions > ==================== > > Please refer to the discussion on my 1st patchset V3 that can be found here: > https://www.spinics.net/lists/linux-bluetooth/msg70315.html > You accepted the first 3 patches of my series of commits namely: > Bluetooth: hci_ldisc: Add missing return in hci_uart_init_work() > Bluetooth: hci_ldisc: Ensure hu->hdev set to NULL before freeing hdev > Bluetooth: hci_ldisc: Add missing clear HCI_UART_PROTO_READY > > drivers/bluetooth/hci_ldisc.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > Please refer to the discussion on my 1st patchset V2 that can be found here: > https://www.spinics.net/lists/linux-bluetooth/msg70183.html > > Please refer to the discussion on my 1st RFC patchset V1 that can be found here: > https://www.spinics.net/lists/linux-bluetooth/msg70077.html > > > Failure test case > ================= > > Note this failure is probably reproducible on many series 3 and 4 kernels. > > Connect a UART based Bluetooth Radio Module via a USB to serial adaptor to a > Linux PC running kernel v4.10.11. This module used the Data Link protocol called > BCSP but that is not too important because the issue is common to all the > Data Link protocols. If the Data Link protocol supports a retransmission > procedure then the probability of a kernel crash is higher. > > Use the Bluez utility called btattach to try to use BCSP. Note the BCSP > implementation is incomplete in the kernel so BCSP will not establish a > connection with the Bluetooth Radio Module. However, this is ideal in exposing > the design issue in hci_uart_tty_close() because BCSP performs retransmissions. > > I created this simple testcase script to trigger a kernel crash in > approximately 10 minutes of running. > > #!/bin/bash > > let i=1 > > while [ true ] > do > echo "loop $i" > ./btattach -B /dev/ttyUSB0 -P bcsp -r & > sleep 5 > echo "now killing..." > killall btattach > i=$(($i + 1)) > done > > I got the following kernel crash after 78 loops. > > Here is a snippet of the dmesg log (dynamic debug used) leading up to the crash: > > Note BCSP is retransmitting during hci_uart_tty_close() which increases the > probability of crashing. > > [ 2932.969451] hci_uart_tty_close: tty ffff8c015f11d400 > [ 2932.969455] hci_uart_close: hdev ffff8c01e3ee4000 > [ 2932.969458] hci_uart_flush: hdev ffff8c01e3ee4000 tty ffff8c015f11d400 > [ 2932.969460] bcsp_flush: hu ffff8c01f0bca240 > [ 2932.969465] hci_unregister_dev: ffff8c01e3ee4000 name hci1 bus 3 > [ 2933.154206] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 2 pkts > [ 2933.154231] hci_uart_tx_wakeup: > [ 2933.154446] bcsp_prepare_pkt: We request packet no 0 to card > [ 2933.154447] bcsp_prepare_pkt: Sending packet with seqno 0 > [ 2933.154465] bcsp_prepare_pkt: We request packet no 0 to card > [ 2933.154466] bcsp_prepare_pkt: Sending packet with seqno 1 > [ 2933.410201] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 2 pkts > [ 2933.410240] hci_uart_tx_wakeup: > [ 2933.410307] bcsp_prepare_pkt: We request packet no 0 to card > [ 2933.410308] bcsp_prepare_pkt: Sending packet with seqno 0 > [ 2933.410321] bcsp_prepare_pkt: We request packet no 0 to card > [ 2933.410322] bcsp_prepare_pkt: Sending packet with seqno 1 > [ 2933.666198] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 2 pkts > [ 2933.666225] hci_uart_tx_wakeup: > < snipped: do not show all BCSP retransmissions that occur every 250ms > > [ 2934.182902] Bluetooth: hci1 command 0x1001 tx timeout > [ 2934.182916] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 1 > [ 2934.182920] hci_send_frame: hci1 type 1 len 3 > [ 2934.182922] hci_uart_send_frame: hci1: type 1 len 3 > [ 2934.182923] hci_uart_tx_wakeup: > [ 2934.182928] bcsp_prepare_pkt: We request packet no 0 to card > [ 2934.182929] bcsp_prepare_pkt: Sending packet with seqno 2 > [ 2934.434214] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 3 pkts > [ 2934.434217] hci_uart_tx_wakeup: > [ 2934.434243] bcsp_prepare_pkt: We request packet no 0 to card > [ 2934.434244] bcsp_prepare_pkt: Sending packet with seqno 0 > [ 2934.434263] bcsp_prepare_pkt: We request packet no 0 to card > [ 2934.434264] bcsp_prepare_pkt: Sending packet with seqno 1 > [ 2934.434268] bcsp_prepare_pkt: We request packet no 0 to card > [ 2934.434269] bcsp_prepare_pkt: Sending packet with seqno 2 > [ 2934.690178] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 3 pkts > [ 2934.690217] hci_uart_tx_wakeup: > < snipped: do not show all BCSP retransmissions that occur every 250ms > > [ 2936.226215] Bluetooth: hci1 command 0x1009 tx timeout > [ 2936.226233] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 0 > [ 2936.226280] bcsp_prepare_pkt: We request packet no 0 to card > [ 2936.226282] bcsp_prepare_pkt: Sending packet with seqno 0 > [ 2936.226300] bcsp_prepare_pkt: We request packet no 0 to card > [ 2936.226301] bcsp_prepare_pkt: Sending packet with seqno 1 > [ 2936.226304] bcsp_prepare_pkt: We request packet no 0 to card > [ 2936.226306] bcsp_prepare_pkt: Sending packet with seqno 2 > [ 2936.482149] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 3 pkts > [ 2936.482179] hci_uart_tx_wakeup: > [ 2936.482259] bcsp_prepare_pkt: We request packet no 0 to card > [ 2936.482261] bcsp_prepare_pkt: Sending packet with seqno 0 > [ 2936.482280] bcsp_prepare_pkt: We request packet no 0 to card > [ 2936.482281] bcsp_prepare_pkt: Sending packet with seqno 1 > [ 2936.482285] bcsp_prepare_pkt: We request packet no 0 to card > [ 2936.482286] bcsp_prepare_pkt: Sending packet with seqno 2 > [ 2936.738141] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 3 pkts > [ 2936.738179] hci_uart_tx_wakeup: > < snipped: do not show all BCSP retransmissions that occur every 250ms > > [ 2940.066125] hci_uart_tx_wakeup: > [ 2940.066200] bcsp_prepare_pkt: We request packet no 0 to card > [ 2940.066202] bcsp_prepare_pkt: Sending packet with seqno 0 > [ 2940.066221] bcsp_prepare_pkt: We request packet no 0 to card > [ 2940.066223] bcsp_prepare_pkt: Sending packet with seqno 1 > [ 2940.066227] bcsp_prepare_pkt: We request packet no 0 to card > [ 2940.066230] bcsp_prepare_pkt: Sending packet with seqno 2 > [ 2940.322113] bcsp_timed_event: hu ffff8c01f0bca240 retransmitting 3 pkts > [ 2940.322126] hci_uart_tx_wakeup: > [ 2940.322147] hci_uart_close: hdev ffff8c01e3ee4000 > [ 2940.322149] hci_uart_flush: hdev ffff8c01e3ee4000 tty ffff8c015f11d400 > [ 2940.322164] hci_dev_do_close: hci1 ffff8c01e3ee4000 > [ 2940.322335] hci_conn_params_clear_all: All LE connection parameters were removed > [ 2940.322340] bcsp_close: hu ffff8c01f0bca240 > [ 2940.322667] ftdi_set_termios: ftdi_sio ttyUSB0: Setting CS8 > [ 2940.322686] BUG: unable to handle kernel NULL pointer dereference at 0000000000000044 > [ 2940.322750] IP: _raw_spin_lock_irqsave+0x22/0x40 > [ 2940.322779] PGD 0 > [ 2940.383274] Oops: 0002 [#1] SMP > [ 2940.384001] Modules linked in: hci_uart btqca xt_set ip_set_hash_ip ip_set nf_log_ipv4 nf_log_ipv6 nf_log_common xt_LOG ip6table_nat nf_nat_ipv6 xt_recent iptable_nat nf_nat_ipv4 ip6t_REJECT nf_reject_ipv6 xt_comment ip6table_mangle ipt_REJECT nf_reject_ipv4 xt_addrtype bridge stp llc xt_mark iptable_mangle xt_tcpudp iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_tftp xt_CT nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre ip6table_raw nf_nat_irc xt_multiport nf_nat_h323 nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ftp nf_nat_amanda nf_nat xt_conntrack nf_conntrack_sane nf_conntrack_tftp ts_kmp nf_conntrack_sip nf_conntrack_amanda nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nfnetlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 > [ 2940.388162] nf_conntrack_ftp nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables lockd grace ctr ccm af_packet bnep msr arc4 brcmsmac uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 cordic btusb btrtl ftdi_sio brcmutil btbcm videobuf2_core btintel mac80211 usbserial bluetooth intel_powerclamp videodev coretemp kvm_intel kvm media snd_hda_codec_hdmi snd_hda_codec_realtek cfg80211 snd_hda_codec_generic snd_hda_intel joydev snd_hda_codec iTCO_wdt irqbypass snd_hda_core toshiba_acpi sparse_keymap iTCO_vendor_support crc32c_intel bcma industrialio snd_hwdep snd_pcm psmouse toshiba_bluetooth snd_timer input_leds serio_raw mei_me rfkill ac e1000e snd thermal wmi shpchp intel_ips battery toshiba_haps soundcore fjes lpc_ich mei ptp pps_core tpm_tis tpm_tis_core tpm cpufreq_ondemand > [ 2940.393216] cpufreq_conservative cpufreq_powersave acpi_cpufreq evdev nvram sunrpc sch_fq_codel ipv6 crc_ccitt autofs4 hid_generic usbhid hid sdhci_pci mmc_block sr_mod sdhci ehci_pci ehci_hcd mmc_core usbcore usb_common i915 video button i2c_algo_bit drm_kms_helper drm > [ 2940.395473] CPU: 3 PID: 6669 Comm: kworker/3:1 Not tainted 4.10.11_pure #1 > [ 2940.396606] Hardware name: TOSHIBA Satellite R630/Portable PC, BIOS Version 1.40 07/20/2010 > [ 2940.397763] Workqueue: events hci_uart_write_work [hci_uart] > [ 2940.399076] task: ffff8c01f2883600 task.stack: ffffb2a704d34000 > [ 2940.400360] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40 > [ 2940.401511] RSP: 0018:ffffb2a704d37da8 EFLAGS: 00010046 > [ 2940.402660] RAX: 0000000000000000 RBX: 0000000000000296 RCX: ffff8c01f2883600 > [ 2940.403829] RDX: 0000000000000001 RSI: ffff8c01f7cd8ba0 RDI: 0000000000000044 > [ 2940.404995] RBP: ffffb2a704d37db0 R08: ffff8c01f7cd8b80 R09: 00000000000001f9 > [ 2940.406157] R10: 000002ac98e2f8d0 R11: 0000000000000001 R12: 0000000000000030 > [ 2940.407317] R13: 0000000000000044 R14: 0000000000000030 R15: ffff8c01f0bca780 > [ 2940.408495] FS: 0000000000000000(0000) GS:ffff8c01f7cc0000(0000) knlGS:0000000000000000 > [ 2940.409671] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2940.410845] CR2: 0000000000000044 CR3: 00000001306fb000 CR4: 00000000000006e0 > [ 2940.412049] Call Trace: > [ 2940.413276] skb_dequeue+0x1d/0x70 > [ 2940.414511] bcsp_dequeue+0x27/0x180 [hci_uart] > [ 2940.415760] hci_uart_write_work+0xc4/0x100 [hci_uart] > [ 2940.417012] process_one_work+0x151/0x410 > [ 2940.418248] worker_thread+0x69/0x4b0 > [ 2940.419494] kthread+0x114/0x150 > [ 2940.420753] ? rescuer_thread+0x340/0x340 > [ 2940.422017] ? kthread_park+0x90/0x90 > [ 2940.423295] ret_from_fork+0x2c/0x40 > [ 2940.424563] Code: 89 e5 e8 22 55 98 ff 5d c3 66 66 66 66 90 55 48 89 e5 53 9c 58 66 66 90 66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00 00 0f b1 17 85 c0 75 06 48 89 d8 5b 5d c3 89 c6 e8 c9 41 98 ff > [ 2940.427448] RIP: _raw_spin_lock_irqsave+0x22/0x40 RSP: ffffb2a704d37da8 > [ 2940.428956] CR2: 0000000000000044 > [ 2940.435363] ---[ end trace 26e358778a5b55c5 ]--- > > > Analysis > ======== > > The kernel crashes because the work queue hu->write_work runs > hci_uart_write_work() which attempts to dequeue a Data Layer protocol frame > using hu->proto->dequeue(). Unfortunately, hci_uart_tty_close() has already > called hu->proto->close() which removes the resources needed by > hu->proto->dequeue() and a kernel crash occurs. > > Assuming device is hci0 (my actual test used hci1). > hciattach or btattach can be used and are examples. > > Example normal case scenario: > ----------------------------- > > 1. hciconfig hci0 up ==> hci_dev_do_open() and sets HCI_UP > > 2. hciconfig hci0 down ==> hci_dev_do_close() and clears HCI_UP > > 3. killall hciattach ==> hci_dev_do_close() exits early as HCI_UP is clear > > Notice that hci_dev_do_close() runs twice. > > There is probably many seconds between these events so the HCI layer is likely > to be idle before hci_uart_tty_close() runs. Therefore, in an environment where > the communications with the Bluetooth Radio Module are good, the kernel crash is > very unlikely to occur. > > Example failure case scenario: > ------------------------------ > > 1. hciconfig hci0 up ==> hci_dev_do_open() and sets HCI_UP > > 2. killall hciattach ==> hci_dev_do_close() and clears HCI_UP > > In the failure case, hci_dev_do_close() attempts to perform clean-up because > HCI_UP is in the set state. hci_dev_do_close() is running in the callstack: > > hci_uart_tty_close() > hci_unregister_dev() > hci_dev_do_close() > > hci_uart_tty_close() is acting against hci_dev_do_close() by systematically > inhibiting various aspects of the data plane between the Data Link protocol > layer and the UART driver. This seems to be a design flaw; either allow > successful transmissions OR fully inhibit transmissions. Unfortunately, the > design is attempting both at the same time. > > When hci_unregister_dev() returns to hci_uart_tty_close(), 1 or > more HCI commands might have been queued inside the Data Link protocol layer by > hu->proto->enqueue(). The Data Link protocol layer may attempt to periodically > retransmit these HCI commands. > > In any case, attempts at sending any HCI commands will fail during the > execution of hci_uart_tty_close() because the data plane between the Data Link > protocol Layer and the UART driver has been cut in various places including > inside the TTY layer. > > The weakness in hci_uart_tty_close() is that hu->proto->close() is not prevented > from running before or during the execution of the other "proto" function > pointers such as hu->proto->dequeue(). Therefore a race condition exists. > > > Solution implemented by this patchset > ===================================== > > When the flag HCI_UART_PROTO_READY is in the clear state, this means the "proto" > layer is no longer ready. Therefore use HCI_UART_PROTO_READY to prevent any > "proto" function pointer except hu->proto->close() from running. > > Already protected: > p->open() in hci_uart_set_proto() > hu->proto->flush() in hci_uart_flush() > hu->proto->close() in hci_uart_tty_close() > p->close() in hci_uart_set_proto() > hu->proto->close() in hci_uart_init_work() > > The following "proto" functions pointers are now protected by the patchset: > hu->proto->enqueue() in hci_uart_send_frame() > hu->proto->dequeue() in hci_uart_dequeue() > > by the implementations in the patches: > Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame() > Bluetooth: hci_ldisc: Add protocol check to hci_uart_dequeue() > > This protection helps to avoid the kernel crash by ensuring that a "proto" > function pointer except hu->proto->close() does not run after > HCI_UART_PROTO_READY has been cleared in hci_uart_tty_close(). > > However, a complete solution (not implemented in this patchset) requires locking > to prevent hu->proto->close() from running whilst another "proto" function > pointer is already running. The flag HCI_UART_PROTO_READY is insufficient > because another "proto" function pointer may already be running at the point the > flag is cleared. > > In addition the patch: > Bluetooth: hci_ldisc: Add protocol check to hci_uart_tx_wakeup() > > prevents the scheduling of new or retransmitted HCI commands from being queued > on the work queue hu->write_work so that hci_uart_write_work() is not able to > run when HCI_UART_PROTO_READY is in the cleared state. This inhibits > transmission attempts so reduces the probability of hci_uart_write_work() > running after hu->proto->close() has run. > > > Test results with patchset applied > ================================== > > snippet of dmesg log > > [ 1122.904072] hci_uart_tty_close: tty ffff91b59c41c000 > [ 1122.904075] hci_uart_close: hdev ffff91b59c456000 > [ 1122.904078] hci_uart_flush: hdev ffff91b59c456000 tty ffff91b59c41c000 > [ 1122.904080] bcsp_flush: hu ffff91b5a29340c0 > [ 1122.904085] hci_unregister_dev: ffff91b59c456000 name hci1 bus 3 > [ 1122.984911] bcsp_timed_event: hu ffff91b5a29340c0 retransmitting 2 pkts > [ 1124.520958] Bluetooth: hci1 command 0x1001 tx timeout > [ 1124.521009] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 1 > [ 1124.521013] hci_send_frame: hci1 type 1 len 3 > [ 1124.521015] hci_uart_send_frame: hci1: type 1 len 3 > [ 1124.521016] Bluetooth: hci1 sending frame failed (-49) > [ 1126.569048] Bluetooth: hci1 command 0x1009 tx timeout > [ 1126.569065] hci_cmd_work: hci1 cmd_cnt 1 cmd queued 0 > [ 1130.665151] hci_uart_close: hdev ffff91b59c456000 > [ 1130.665154] hci_uart_flush: hdev ffff91b59c456000 tty ffff91b59c41c000 > [ 1130.665173] hci_dev_do_close: hci1 ffff91b59c456000 > [ 1130.665357] hci_conn_params_clear_all: All LE connection parameters were removed > [ 1130.665361] bcsp_close: hu ffff91b5a29340c0 > > Notice the following error message is now output: > Bluetooth: hci1 sending frame failed (-49) > > This is because the patch > Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame() > > is preventing hci_send_frame() from calling hu->proto->enqueue() after > hci_uart_tty_close() has cleared the flag HCI_UART_PROTO_READY. Therefore, > no new HCI commands can be queued within the Data Link protocol layer because > the Data Link protocol layer is pending to be closed. > > > Conclusions > =========== > > This patchset demonstrates the following design issues: > > 1. Attempts are made at trying to send HCI commands during hci_uart_tty_close() > but this is in conflict with the hci_uart_tty_close() closure procedures. > > 2. Data Link protocol layer retransmission attempts can occur during > hci_uart_tty_close() to resend HCI commands but this is in conflict with the > hci_uart_tty_close() closure procedures. > > 3. Sending of HCI commands fail anyway during hci_uart_tty_close() because > the data plane between the Data Link protocol layer and the UART driver has > been cut in various places including inside the TTY layer. Therefore the > attempts at sending HCI commands during hci_uart_tty_close() are futile. > > This means that, currently the design is attempting to send HCI commands AND > trying to inhibit transmissions at the same time. Therefore, there is a conflict > in the procedures. > > The design of hci_uart_tty_close() needs to be changed to either > a) allow sucessful transmission of HCI commands and implement a clean controlled > closure of the Data Link protocol. > OR > b) abruptly inhibit further communications of the Data Link protocol. > > > Side-effects > ------------ > > The patchset should be safe although a side-effect could be some occassional > error messages: > Bluetooth: hci1 sending frame failed (-49) > > Note hci_send_frame() does not propagate the -EUNATCH (-49) error code so takes > no specific action. Therefore, hci_core.c handling is not modified by the > patchset. > > > Dean Jenkins (3): > Bluetooth: hci_ldisc: Add protocol check to hci_uart_send_frame() > Bluetooth: hci_ldisc: Add protocol check to hci_uart_dequeue() > Bluetooth: hci_ldisc: Add protocol check to hci_uart_tx_wakeup() > > drivers/bluetooth/hci_ldisc.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) all 3 patches have been applied to bluetooth-next tree. Regards Marcel