Message-ID: <1349976523.7065.22.camel@rhapsody>
Subject: Re: [PATCH v3 1/2] x86, pci: Reset PCIe devices at boot time
From: Khalid Aziz <khalid@gonehiking.org>
To: Takao Indoh <indou.takao@jp.fujitsu.com>
Cc: martin.wilck@ts.fujitsu.com, linux-pci@vger.kernel.org, x86@kernel.org,
        kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
        hbabu@us.ibm.com, andi@firstfloor.org, ddutile@redhat.com,
        ishii.hironobu@jp.fujitsu.com, hpa@zytor.com, bhelgaas@google.com,
        tglx@linutronix.de, mingo@redhat.com, vgoyal@redhat.com
Date: Thu, 11 Oct 2012 11:28:43 -0600
In-Reply-To: <50766432.9000706@jp.fujitsu.com>
References: <20121010074553.1084.52175.sendpatchset@indoh>
	 <20121010074603.1084.92389.sendpatchset@indoh>
	 <1349899704.25679.14.camel@rhapsody> <50766432.9000706@jp.fujitsu.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4423
Lines: 120

On Thu, 2012-10-11 at 15:16 +0900, Takao Indoh wrote:
> (2012/10/11 5:08), Khalid Aziz wrote:
.....
> >> +static void __init do_reset(u8 bus, u8 slot, u8 func)
> >> +{
> >> +	u16 ctrl;
> >> +
> >> +	printk(KERN_INFO "pci 0000:%02x:%02x.%d reset\n", bus, slot, func);
> >> +
> >> +	/* Assert Secondary Bus Reset */
> >> +	ctrl = read_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL);
> >> +	ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> >> +	write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
> >> +
> >> +	pci_udelay(5000);
> >> +
> >> +	/* De-assert Secondary Bus Reset */
> >> +	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> >> +	write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
> >> +
> >> +	pci_udelay(500000);
> >
> > This is 0.5 second. This will add up quickly on larger servers with
> > multiple busses. Is 0.5 second required by the spec?
> > aer_do_secondary_bus_reset() holds PCI_BRIDGE_CTL_BUS_RESET for 2 ms and
> > then waits another 200 ms after de-asserting it. Still long, but less
> > than half of the delay in above code..
> 
> According to PCIe spec, they should be more than 1ms and 100ms. The reason
> why they're 2ms and 200ms in aer_do_secondary_bus_reset() is that it's a
> kind of safety margin, I think.
> 
> At first these delay were 2ms and 200ms as well as
> aer_do_secondary_bus_reset(), but I found a problem that on certain
> machine it was not enough. I think this problem occurs because
> native_io_delay() is not precise. Therefore I extended them to have more
> margin.
> 
> If this delay should be shortened, I have two solutions.
> 
> 1)
> Call early_reset_pcie_devices() after calibrate_delay() so that we can
> use precise mdelay. Personally I don't want to do this because I think
> DMA should be stopped asap.
> 
> 2)
> Make it tuneable. For exmple, improve "reset_devices"
>      reset_devices=reset_delay=500
> or something like that. As default, 200ms is enough as far as I tested.
> But if it is not enough, we can adjust delay time using this.
> 
> Any other idea?
> 

I don't like the idea of asking end user to determine how many msec of
delay they would need on their system for a reset. If we have to go that
route, I would rather have a default of 200 msec and then add a kernel
option like "reset_devices=reset_delay=long" which changes the delay to
500 msec.

Here is another idea. The code you currently have does:

for (each device) {
  save config registers
  reset
  wait for 500 ms
  restore config registers
}

You need to wait for 500 ms because you can not access config registers
until reset is complete. So how about this - why not just save config
registers, reset each device and then wait only once for 500 ms before
restoring config registers on all devices. Here is what this will look
like:

for (each device) {
  save config registers
  reset
}
wait 500 ms
for (each device) {
  restore config registers
}

This is obviously more complex and requires you to allocate more space
for saving state, but it has the benefits of minimizing boot up delay
and making the delay constant irrespective of number of switches/bridges
on the system.

Does this sound reasonable?


> >
> > We have been seeing problems with kexec/kdump kernel for quite some time
> > that are related to I/O devices not being quiesced before kexec. I had
> > added code to clear Bus Master bit to help stop runaway DMAs which
> > helped many cases, but obviously not all. If resetting downstream ports
> > helps stop runaway I/O from PCIe devices, I am all for this approach.
> > This patch still doesn't do anything for old PCI devices though.
> 
> This patch does not reset PCI devices because of VGA problem. As I wrote
> in patch description, the monitor blacks out when VGA controller is
> reset. So when VGA device is found we need to skip it. In the case of
> PCIe, we can decide whether we do bus reset or not on each device, but
> we cannot in the case of PCI.

OK. My concern is there are plenty of older servers out there that
potentially have the problem with kexec/kdump failing often and we are
not solving the problem for them. If we can only solve it for PCIe for
now, I am ok with starting there.

--
Khalid


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/