Received: by 2002:a05:6a10:c604:0:0:0:0 with SMTP id y4csp863622pxt; Thu, 5 Aug 2021 13:41:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz9gGqGev+RiD4h9lnNjgxB0Pe9uJUcSuwtGVCpPUPzP014lbHBBcEqzOjslOiAvfKTjv/a X-Received: by 2002:a92:8750:: with SMTP id d16mr457655ilm.281.1628196083201; Thu, 05 Aug 2021 13:41:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628196083; cv=none; d=google.com; s=arc-20160816; b=J273HnFV5DwUNHeqoV/r6N/mwDXEQa19ploFxMteXPFARLQeoY+fUgchr9nbue6d7B Hu/9Cxg1Ct/do0W+9QcLXegSPOYYgaSxdBHL+Uq6ah6sNSz4My9X++bh6HeUl7daTdKT xfUVL5Y252jF6mdRT4fTPVus7DAo5jr8KeWesedhIeQy3cYtbBK1uC3Hkw3T3MtsO16G 6r7KRUUYsNG+/3w3ekgerlEOPNYYgBTFH2rwqqlq8DGv/hWzSCtrhAIwooww+xfZzgBn sjbW/JQL0kjdOG1wimc6YRQXeiZ8oOVXIutAsxxH1sXWISFMfsWECVwVTZwvaBK6/vdV aNrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=8wpQ2KEXX0+izlcoX6sF1dG/lpwyAyYfB3n+jCMpr7A=; b=wB7BZtFeY89i3NLO79JiyLEiMrecL/lhE1BVDoChjYN6SZdirKxJ5IKp/17JP59ReS cv3c/fdNjdLgZkY76nxJk+h/Z4ZS8lkrgF5eydCxJsR+5LU5hqhUo7sHefaaivMyfRoj MRTvDM9fRBAw6bupbzuiiUYQ19/aN43FI9N1PC2fhby34RX/8UlHVoFc2ZhCtIelMHHI +4I5qxpGZX+gEWjGJa4PubLrMs2izwg2GLqcGgCV0v3GtONDnAxRXGSF7Vy0hoeQ+I9k an0DicqIcCh7CMfzqu8yma30Bty+FOoZprxai+lrUJZgrLtZ5YNbxRXE+FsvtwikfXLt iOxg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=GBqWXZTK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a5si6595173ioo.37.2021.08.05.13.41.08; Thu, 05 Aug 2021 13:41:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=GBqWXZTK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241336AbhHESkQ (ORCPT + 99 others); Thu, 5 Aug 2021 14:40:16 -0400 Received: from mail.kernel.org ([198.145.29.99]:37076 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229892AbhHESkQ (ORCPT ); Thu, 5 Aug 2021 14:40:16 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id A384D601FF; Thu, 5 Aug 2021 18:40:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1628188801; bh=uieYp5dhE+BgI7VoI96Hu2v7JqETU3Mt5VK7tX6c9+4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=GBqWXZTK4SvQZPLH6flfZWcY/SIWBu7fUYdkdHKHHgPo2QKkibu16RRHUFGqHdRQA RmfWLeB5aUFlLzFrqKeAb57N1MThNF3YdPMRIF+TnzZBqPQMBqSfJEzskEeWhvO8yH xXPx29sFSPYObnTH9vqEC24WsQ68Dsv8A7msmsOA= Date: Thu, 5 Aug 2021 20:39:58 +0200 From: Greg KH To: Mathias Nyman Cc: linux-usb@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: USB xhci crash under load on 5.14-rc3 Message-ID: References: <9bb1d58b-5c68-86b7-13df-2faa749880c5@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <9bb1d58b-5c68-86b7-13df-2faa749880c5@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 05, 2021 at 05:59:00PM +0300, Mathias Nyman wrote: > On 4.8.2021 11.00, Greg KH wrote: > > Hi, > > > > I was doing some filesystem backups from one USB device to another one > > this weekend and kept running into the problem of the xhci controller > > shutting down after an hour or so of high volume traffic. > > > > I finally captured the problem in the kernel log as this would also take > > out my keyboard, making it hard to recover from :) > > > > The log is below for when the problem happens, and then the devices are > > disconnected from the bus (ignore the filesystem errors, those are > > expected when i/o is in flight and we disconnect a device. > > > > Any hint as to what the IO_PAGE_FAULT error messages are? > > > > No idea, unfortunately. > > > I'll go back to 5.13.y now and see if I can reproduce it there or not, > > as my backups are not yet done... > > > > thanks, > > > > greg k-h > > > > > > [Aug 4 09:48] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff00000 flags=0x0000] > > [ +0.000012] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff00f80 flags=0x0000] > > [ +0.000006] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff01000 flags=0x0000] > > [ +0.000006] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff01f80 flags=0x0000] > > [ +0.000005] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff02000 flags=0x0000] > > [ +0.000006] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff02f80 flags=0x0000] > > [ +0.000006] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff03000 flags=0x0000] > > [ +0.000005] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff03f80 flags=0x0000] > > [ +0.000006] xhci_hcd 0000:47:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0032 address=0xfffffff04000 flags=0x0000] > > [Aug 4 09:49] sd 3:0:0:0: [sdc] tag#21 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN > > [ +0.000011] sd 3:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 01 8a 44 08 b0 00 00 00 08 00 00 > > [ +5.106493] xhci_hcd 0000:47:00.1: xHCI host not responding to stop endpoint command. > > [ +0.000010] xhci_hcd 0000:47:00.1: USBSTS: HCHalted HSE > > > HSE "Host System Error" bit is set, meaning xHC hardware detected a serious error and stopped the host. > HSE was probably set 5-10 seconds earlier, but only discovered here. > > Specs state: > > xHC sets this bit to ‘1’ when a serious error > is detected, either internal to the xHC or during a host system access involving the xHC module. > (In a PCI system, conditions that set this bit to ‘1’ include PCI Parity error, PCI Master Abort, and > PCI Target Abort.) Ok, I would believe in a PCI error here, hammering a xhci controller with read/write streams to two different storage devices on the same bus for a few hours as fast as the bus allows is a good stress test. I tried splitting this across PCI devices, and can not seem to duplicate the failure in the xhci controllers, now the devices fail with disk errors after about a terrabyte of traffic, but are recoverable after unplug/plugging them back in and running fsck. Cheap USB storage, gotta love it... If I come up with a reproducable failure, I'll let you know, thanks for the help, greg k-h