Short:
1. Calling pci_remove_bus_device() on some of the devices on a bus will
remove them. (ok)
2. Calling pci_scan_child_bus() on that bus will scan the bus for
devices and add them all. even the ones already in the bus device list,
which will now be duplicated in the bus device list! (no so ok)
3. Calling pci_bus_add_devices() Will try to add the device, assign
resource, and create procfs files. The devices which were not originally
removed are were being assigned resource they don't need and procfs
files already exist and will collide (Bug)
Background:
I have a programmable component which implements a pci device. Upon
warm update i need to remove my devices from the bus, upgrade, and add
them back.
I have bashed my head against pci.h API for the last week trying to make
this work with no success. Right now I have devised a hideous hack to
circumvent pci.h and directly remap the pci device registers.
demo code:
reload_my_hardware_design() {
struct pci_dev *dev = NULL;
struct pci_bus *bus = NULL;
while ((dev = pci_get_device(PCI_VENDOR_ID_MY_ID,PCI_ANY_ID,NULL))) {
pci_remove_bus_device(dev);
}
...
reload the hardware
...
bus = pci_find_bus(0,0); /* though hard coded in this example this
is the right bus*/
pci_scan_child_bus(bus);
pci_bus_add_devices(bus);
}
Is this a bug or am I doing something wrong?
Liberty
Dear Penguins,
As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I
have solved a small bug/misbehavior within the pci_scan_single_device().
The misbehavior manifest itself upon calling pci_scan_child_bus() with a
bus which already contain some devices. After scanning is done devices
that already existed will be present twice in the PCI bus devise list.
Once with is_added indication and once without. Trying to add this bus
will cause a resource conflict as the same device is already present and
initialized.
This patch will simply prevent a device to be added to a bus list if it
is already there.
Points to consider:
1. I am not a PCI Guru and might have over looked the bigger picture. Is
it OK to prevent a device of being re-added?
2. I have decided that two devices are, in fact, the same instance if it
has the same: vendor, device, and devfn. Is there a finer test for a
device identity?
p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is
preferred for Mozilla Firefox clients to attach the patch, hence my
patch is attached.
Liberty
Signed-off-by: Eran Liberty <[email protected]>
---
The subject is wrong! it is 2.6.26-rc4.
I will re-email with a corrected subject for automated scripts sake.
Liberty
Eran Liberty wrote:
> Dear Penguins,
>
> As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ),
> I have solved a small bug/misbehavior within the
> pci_scan_single_device().
>
> The misbehavior manifest itself upon calling pci_scan_child_bus() with
> a bus which already contain some devices. After scanning is done
> devices that already existed will be present twice in the PCI bus
> devise list. Once with is_added indication and once without. Trying to
> add this bus will cause a resource conflict as the same device is
> already present and initialized.
>
> This patch will simply prevent a device to be added to a bus list if
> it is already there.
>
> Points to consider:
> 1. I am not a PCI Guru and might have over looked the bigger picture.
> Is it OK to prevent a device of being re-added?
> 2. I have decided that two devices are, in fact, the same instance if
> it has the same: vendor, device, and devfn. Is there a finer test for
> a device identity?
>
> p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is
> preferred for Mozilla Firefox clients to attach the patch, hence my
> patch is attached.
>
> Liberty
>
> Signed-off-by: Eran Liberty <[email protected]>
> ---
Dear Penguins,
As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I
have solved a small bug/misbehavior within the pci_scan_single_device().
The misbehavior manifest itself upon calling pci_scan_child_bus() with a
bus which already contain some devices. After scanning is done devices
that already existed will be present twice in the PCI bus devise list.
Once with is_added indication and once without. Trying to add this bus
will cause a resource conflict as the same device is already present and
initialized.
This patch will simply prevent a device to be added to a pci bus list if
it is already there.
Points to consider:
1. I am not a PCI Guru and might have over looked the bigger picture. Is
it OK to prevent a device of being re-added?
2. I have decided that two devices are, in fact, the same instance if it
has the same: vendor, device, and devfn. Is there a finer test for a
device identity?
p.s. As per http://www.kernel.org/pub/linux/docs/lkml/#s1-10 it is
preferred for Mozilla Firefox clients to attach the patch, hence my
patch is attached.
Liberty
Signed-off-by: Eran Liberty <[email protected]>
---
On Mon, Jul 21, 2008 at 10:18:19PM +0300, Eran Liberty wrote:
> Dear Penguins,
>
> As a follow up on my own post ( http://lkml.org/lkml/2008/6/18/195 ), I
> have solved a small bug/misbehavior within the pci_scan_single_device().
To quote your original post here for ease of replying:
> 1. Calling pci_remove_bus_device() on some of the devices on a bus will
> remove them. (ok)
> 2. Calling pci_scan_child_bus() on that bus will scan the bus for
> devices and add them all. even the ones already in the bus device list,
> which will now be duplicated in the bus device list! (no so ok)
I think this is your real problem, that you're rescanning the entire
bus. I don't think that's the route we'd recommend taking.
Why don't you call pci_scan_slot() instead? You won't get the benefit of
pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus
that's already been scanned.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
> I think this is your real problem, that you're rescanning the entire
> bus. I don't think that's the route we'd recommend taking.
>
My stating point was that I have loaded a new design into a
programmable device which sits on the pci device. The new design can
implement numerous pci devices or non at all. I can think of an easy
way (or clean one) to scan only the programmable device. Scanning the
whole bus seemed reasonable.
> Why don't you call pci_scan_slot() instead? You won't get the benefit of
> pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus
> that's already been scanned.
>
As said its not exactly a slot its more like a regular pci device that
someone suddenly welded into the pci bus. Its not a hotplug as well,
and I do not want to give up on the pcibios_fixup_bus()
As it is, with my patch applied i successfully go over the bus and
remove my own devices before I reprogram the
programmable device.
while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYDEV,NULL))
!= NULL) {
pci_remove_bus_device(dev);
pci_dev_put(dev);
}
Load a new design into it.
Then scan the entire bus and add the newly discovered devices.
bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
pci_scan_child_bus(bus);
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
As seen here, this sequence of instructions seems very intuitive. It
will fail without the patch upon pci_bus_add_devices().
Liberty
On Tue, Jul 22, 2008 at 11:21:06AM +0300, eran liberty wrote:
> > I think this is your real problem, that you're rescanning the entire
> > bus. I don't think that's the route we'd recommend taking.
>
> My stating point was that I have loaded a new design into a
> programmable device which sits on the pci device. The new design can
> implement numerous pci devices or non at all. I can think of an easy
> way (or clean one) to scan only the programmable device. Scanning the
> whole bus seemed reasonable.
That's what pci_scan_slot() is for. It scans the first function at the
device number, then (if the header indicates it's a multifunction
device) scans the other functions associated with that device. eg you
could call pci_scan_slot(bus, 0x30) and it will create function 06.0
(and potentially 06.1, 06.2, ...)
You presumably already have the devfn for the existing device since
you're able to call pci_remove_bus_device().
> > Why don't you call pci_scan_slot() instead? You won't get the benefit of
> > pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus
> > that's already been scanned.
>
> As said its not exactly a slot its more like a regular pci device that
> someone suddenly welded into the pci bus. Its not a hotplug as well,
> and I do not want to give up on the pcibios_fixup_bus()
Why not? What architecture are you using? What does
pcibios_fixup_bus() do for you?
(as a side-note, I'd like to reimplement the pcibios_fixup_*() routines;
I think a lot of what they do can be done more generically these days.
It'll take a while and isn't high on my priority list).
> As it is, with my patch applied i successfully go over the bus and
> remove my own devices before I reprogram the
> programmable device.
>
> while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYDEV,NULL))
> != NULL) {
> pci_remove_bus_device(dev);
> pci_dev_put(dev);
> }
>
> Load a new design into it.
>
> Then scan the entire bus and add the newly discovered devices.
>
> bus = null;
> while ((bus = pci_find_next_bus(bus)) != NULL) {
> pci_scan_child_bus(bus);
> pci_bus_assign_resources(bus);
> pci_bus_add_devices(bus);
> }
>
> As seen here, this sequence of instructions seems very intuitive. It
> will fail without the patch upon pci_bus_add_devices().
Seems utterly unintuitive to me. You're doing a lot of unnecessary work
here, and if you have two cards in your machine, you'll take away both
of them when you reload either of them.
What you should do is cache the pci_bus and the devfn at startup:
static struct pci_bus *my_bus;
static int my_devfn;
struct pci_dev *dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,
PCI_DEVICE_ID_MYDEV, NULL);
if (!dev)
return -ENODEV;
my_bus = dev->bus;
my_devfn = dev->devfn;
pci_dev_put(dev);
when you want to remove it:
for (func = 0; func < 8; func++)
struct pci_dev *dev = pci_get_slot(my_bus, my_devfn + func);
if (!dev)
continue;
pci_remove_bus_device(dev);
pci_dev_put(dev);
}
when you want to rescan it:
pci_scan_slot(my_bus, my_devfn);
(this only handles one programmable card. The basic idea could be
extended to handle multiple cards if you need to do that).
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Matthew,
You seem to have a finer grasp of the subject then I do, please correct/educate me on any of the points I raise in the following lines.
Matthew Wilcox wrote:
> On Tue, Jul 22, 2008 at 11:21:06AM +0300, eran liberty wrote:
>>> I think this is your real problem, that you're rescanning the entire
>>> bus. I don't think that's the route we'd recommend taking.
>> My stating point was that I have loaded a new design into a
>> programmable device which sits on the pci device. The new design can
>> implement numerous pci devices or non at all. I can think of an easy
>> way (or clean one) to scan only the programmable device. Scanning the
>> whole bus seemed reasonable.
> That's what pci_scan_slot() is for. It scans the first function at the
> device number, then (if the header indicates it's a multifunction
> device) scans the other functions associated with that device. eg you
> could call pci_scan_slot(bus, 0x30) and it will create function 06.0
> (and potentially 06.1, 06.2, ...)
> You presumably already have the devfn for the existing device since
> you're able to call pci_remove_bus_device().
Each slot represent a single device which can have more then one function. pci_scan_slot is aimed for scanning these multiple functions.
I, on the other hand, have programmable device on the pci bus which is, for the sake of this discussion, a complete black box.
This black box up on loading can implement more then one device, which can have more then one function each.
So as far as I see it, now I need to scan all slots on the bus.
But to be honest, upon looking a way to make my device work I dismissed the "pci_scan_slot()" option as It did not reach the "fixup_resource <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=fixup_resource>()" part.
>>> Why don't you call pci_scan_slot() instead? You won't get the benefit of
>>> pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus
>>> that's already been scanned.
>>>
>> As said its not exactly a slot its more like a regular pci device that
>> someone suddenly welded into the pci bus. Its not a hotplug as well,
>> and I do not want to give up on the pcibios_fixup_bus()
>>
>
> Why not? What architecture are you using? What does
> pcibios_fixup_bus() do for you?
>
I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the resource bars allocation.
I needed Linux to renegotiate the resources bars on the PCI devices.
> (as a side-note, I'd like to reimplement the pcibios_fixup_*() routines;
> I think a lot of what they do can be done more generically these days.
> It'll take a while and isn't high on my priority list).
>
If I can lend a hand there, let me know and I will try to squeeze it in somewhere.
>
>> As it is, with my patch applied i successfully go over the bus and
>> remove my own devices before I reprogram the
>> programmable device.
>>
>> while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYDEV,NULL))
>> != NULL) {
>> pci_remove_bus_device(dev);
>> pci_dev_put(dev);
>> }
>>
>> Load a new design into it.
>>
>> Then scan the entire bus and add the newly discovered devices.
>>
>> bus = null;
>> while ((bus = pci_find_next_bus(bus)) != NULL) {
>> pci_scan_child_bus(bus);
>> pci_bus_assign_resources(bus);
>> pci_bus_add_devices(bus);
>> }
>>
>> As seen here, this sequence of instructions seems very intuitive. It
>> will fail without the patch upon pci_bus_add_devices().
>>
>
> Seems utterly unintuitive to me. You're doing a lot of unnecessary work
> here, and if you have two cards in your machine, you'll take away both
> of them when you reload either of them.
>
Hmmm, I do want to remove all the devices that are implemented by the programmable unit which is reloaded.
I have not considered the possibility of having more then one programmable unit.
I guess that the removing part can be more fine tuned as the need arises.
> What you should do is cache the pci_bus and the devfn at startup:
>
> static struct pci_bus *my_bus;
> static int my_devfn;
>
> struct pci_dev *dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,
> PCI_DEVICE_ID_MYDEV, NULL);
> if (!dev)
> return -ENODEV;
> my_bus = dev->bus;
> my_devfn = dev->devfn;
> pci_dev_put(dev);
>
> when you want to remove it:
>
> for (func = 0; func < 8; func++)
> struct pci_dev *dev = pci_get_slot(my_bus, my_devfn + func);
> if (!dev)
> continue;
> pci_remove_bus_device(dev);
> pci_dev_put(dev);
> }
>
> when you want to rescan it:
>
> pci_scan_slot(my_bus, my_devfn);
>
> (this only handles one programmable card. The basic idea could be
> extended to handle multiple cards if you need to do that).
>
I think there is a hidden assumption in this code, again please correct me if I missed the point.
This code assumes that the devices which will re-appear after the programmable unit is loaded has the same devfn and bus as the devices which were present before the reload.
This assumption might be wrong.
For example, I have a basic programmable image which has no pci devices at all.
upon unloading I do not remove any device (as non are present) and up on reloading I suddenly have two. What is their bus? their devfn?
Ultimately I would have expected to find a "int pci_scan_bus(struct <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_scan_bus>pci_bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_bus> *bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=bus>);" the "pci_scan_child_bus <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_scan_child_bus>()" was the closest to the mark
Liberty
WOW, sorry for the html tags (though I had this Thunderbird under
control). Here is the same mail without html tags (hopeully)
Matthew,
You seem to have a finer grasp of the subject then I do, please
correct/educate me on any of the points I raise in the following lines.
Matthew Wilcox wrote:
> On Tue, Jul 22, 2008 at 11:21:06AM +0300, eran liberty wrote:
>>> I think this is your real problem, that you're rescanning the entire
>>> bus. I don't think that's the route we'd recommend taking.
>> My stating point was that I have loaded a new design into a
>> programmable device which sits on the pci device. The new design can
>> implement numerous pci devices or non at all. I can think of an easy
>> way (or clean one) to scan only the programmable device. Scanning the
>> whole bus seemed reasonable.
> That's what pci_scan_slot() is for. It scans the first function at the
> device number, then (if the header indicates it's a multifunction
> device) scans the other functions associated with that device. eg you
> could call pci_scan_slot(bus, 0x30) and it will create function 06.0
> (and potentially 06.1, 06.2, ...)
> You presumably already have the devfn for the existing device since
> you're able to call pci_remove_bus_device().
Each slot represent a single device which can have more then one
function. pci_scan_slot is aimed for scanning these multiple functions.
I, on the other hand, have programmable device on the pci bus which is,
for the sake of this discussion, a complete black box.
This black box up on loading can implement more then one device, which
can have more then one function each.
So as far as I see it, now I need to scan all slots on the bus.
But to be honest, upon looking a way to make my device work I dismissed
the "pci_scan_slot()" option as It did not reach the "fixup_resource ()"
part.
>>> Why don't you call pci_scan_slot() instead? You won't get the benefit of
>>> pcibios_fixup_bus(), but I'm not convinced that's safe to call on a bus
>>> that's already been scanned.
>>>
>> As said its not exactly a slot its more like a regular pci device that
>> someone suddenly welded into the pci bus. Its not a hotplug as well,
>> and I do not want to give up on the pcibios_fixup_bus()
>>
>
> Why not? What architecture are you using? What does
> pcibios_fixup_bus() do for you?
>
I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the
resource bars allocation.
I needed Linux to renegotiate the resources bars on the PCI devices.
> (as a side-note, I'd like to reimplement the pcibios_fixup_*() routines;
> I think a lot of what they do can be done more generically these days.
> It'll take a while and isn't high on my priority list).
>
If I can lend a hand there, let me know and I will try to squeeze it in
somewhere.
>
>> As it is, with my patch applied i successfully go over the bus and
>> remove my own devices before I reprogram the
>> programmable device.
>>
>> while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYDEV,NULL))
>> != NULL) {
>> pci_remove_bus_device(dev);
>> pci_dev_put(dev);
>> }
>>
>> Load a new design into it.
>>
>> Then scan the entire bus and add the newly discovered devices.
>>
>> bus = null;
>> while ((bus = pci_find_next_bus(bus)) != NULL) {
>> pci_scan_child_bus(bus);
>> pci_bus_assign_resources(bus);
>> pci_bus_add_devices(bus);
>> }
>>
>> As seen here, this sequence of instructions seems very intuitive. It
>> will fail without the patch upon pci_bus_add_devices().
>>
>
> Seems utterly unintuitive to me. You're doing a lot of unnecessary work
> here, and if you have two cards in your machine, you'll take away both
> of them when you reload either of them.
>
Hmmm, I do want to remove all the devices that are implemented by the
programmable unit which is reloaded.
I have not considered the possibility of having more then one
programmable unit.
I guess that the removing part can be more fine tuned as the need arises.
> What you should do is cache the pci_bus and the devfn at startup:
>
> static struct pci_bus *my_bus;
> static int my_devfn;
>
> struct pci_dev *dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,
> PCI_DEVICE_ID_MYDEV, NULL);
> if (!dev)
> return -ENODEV;
> my_bus = dev->bus;
> my_devfn = dev->devfn;
> pci_dev_put(dev);
>
> when you want to remove it:
>
> for (func = 0; func < 8; func++)
> struct pci_dev *dev = pci_get_slot(my_bus, my_devfn + func);
> if (!dev)
> continue;
> pci_remove_bus_device(dev);
> pci_dev_put(dev);
> }
>
> when you want to rescan it:
>
> pci_scan_slot(my_bus, my_devfn);
>
> (this only handles one programmable card. The basic idea could be
> extended to handle multiple cards if you need to do that).
>
I think there is a hidden assumption in this code, again please correct
me if I missed the point.
This code assumes that the devices which will re-appear after the
programmable unit is loaded has the same devfn and bus as the devices
which were present before the reload.
This assumption might be wrong.
For example, I have a basic programmable image which has no pci devices
at all.
upon unloading I do not remove any device (as non are present) and up on
reloading I suddenly have two. What is their bus? their devfn?
Ultimately I would have expected to find a "int pci_scan_bus(struct
pci_bus *bus );" the "pci_scan_child_bus ()" was the closest to the mark
Liberty
On Tue, Jul 22, 2008 at 04:14:41PM +0300, Eran Liberty wrote:
> Each slot represent a single device which can have more then one
> function. pci_scan_slot is aimed for scanning these multiple functions.
Exactly.
> I, on the other hand, have programmable device on the pci bus which is,
> for the sake of this discussion, a complete black box.
> This black box up on loading can implement more then one device, which
> can have more then one function each.
> So as far as I see it, now I need to scan all slots on the bus.
That's quite interesting. A normal PCI device has its IDSEL pin
connected to one of the AD lines, so it's not possible for it to respond
to more then one of the 32 possible devices. I suppose by decoding
the configuration cycles on the bus, it's possible to make a
programmable device respond to more than one device. I suppose you have
to be careful to program it to not respond to the same device numbers as
any other device on the bus. Or are there no other devices on this bus
(in which case I have another solution for you)?
> But to be honest, upon looking a way to make my device work I dismissed
> the "pci_scan_slot()" option as It did not reach the "fixup_resource ()"
> part.
[...]
> I work with ARCH=powerpc. pcibios_fixup_bus() will deal with all the
> resource bars allocation.
> I needed Linux to renegotiate the resources bars on the PCI devices.
OK. I don't think it's safe to call powerpc's fixup_resource() more
than once for a given resource.
> >(as a side-note, I'd like to reimplement the pcibios_fixup_*() routines;
> >I think a lot of what they do can be done more generically these days.
> >It'll take a while and isn't high on my priority list).
> >
> If I can lend a hand there, let me know and I will try to squeeze it in
> somewhere.
What I want to do is get rid of fixup_resource() and its equivalent on
other architectures and call pcibios_bus_to_resource() from
drivers/pci/probe.c.
I've been working in this area recently; here's my current tree:
http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=pci-read-base
I've had a quick hack at doing what I want, but I haven't tested it.
> I think there is a hidden assumption in this code, again please correct
> me if I missed the point.
> This code assumes that the devices which will re-appear after the
> programmable unit is loaded has the same devfn and bus as the devices
> which were present before the reload.
> This assumption might be wrong.
Yes, I did make that assumption ... I'm used to real PCI devices, not
this unusual box ;-)
> For example, I have a basic programmable image which has no pci devices
> at all.
> upon unloading I do not remove any device (as non are present) and up on
> reloading I suddenly have two. What is their bus? their devfn?
>
> Ultimately I would have expected to find a "int pci_scan_bus(struct
> pci_bus *bus );" the "pci_scan_child_bus ()" was the closest to the mark
If this is all alone on a bus of its own, we can certainly handle that.
Maybe we can make it a module parameter or something.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Matthew Wilcox wrote:
> If this is all alone on a bus of its own, we can certainly handle that.
> Maybe we can make it a module parameter or something.
>
If my devices where the only one on the bus I would not have stumbled on
this bug. I would have removed them all before reloading and would have
scanned and empty PCI bus list after the reloading.
Alas, the CPU itself is a pci device. Upon scanning the bus it is it who
is duplicated. Upon trying to "pci_bus_add_devices()" you get resource
collision.
Do you perceive any down side to my suggested patch?
It seems (to me) really logical that a device can not and should not be
added twice to a pci bus device list.
Liberty
On Tue, Jul 22, 2008 at 06:25:40PM +0300, Eran Liberty wrote:
> If my devices where the only one on the bus I would not have stumbled on
> this bug. I would have removed them all before reloading and would have
> scanned and empty PCI bus list after the reloading.
That's not true, the code you posted:
bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
pci_scan_child_bus(bus);
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
will scan all busses again.
> Alas, the CPU itself is a pci device. Upon scanning the bus it is it who
> is duplicated. Upon trying to "pci_bus_add_devices()" you get resource
> collision.
Does the CPU have BARs of its own? If you have no other devices with
BARs, it would explain why you have not noticed the problem with
fixup_resource ;-)
> Do you perceive any down side to my suggested patch?
> It seems (to me) really logical that a device can not and should not be
> added twice to a pci bus device list.
The problem is the other side-effects of what you propose.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Matthew Wilcox wrote:
> On Tue, Jul 22, 2008 at 06:25:40PM +0300, Eran Liberty wrote:
>
>> If my devices where the only one on the bus I would not have stumbled on
>> this bug. I would have removed them all before reloading and would have
>> scanned and empty PCI bus list after the reloading.
>>
>
> That's not true, the code you posted:
>
> bus = null;
> while ((bus = pci_find_next_bus(bus)) != NULL) {
> pci_scan_child_bus(bus);
> pci_bus_assign_resources(bus);
> pci_bus_add_devices(bus);
> }
>
> will scan all busses again.
>
You are correct! I have a single bus :)
>
>> Alas, the CPU itself is a pci device. Upon scanning the bus it is it who
>> is duplicated. Upon trying to "pci_bus_add_devices()" you get resource
>> collision.
>>
>
> Does the CPU have BARs of its own? If you have no other devices with
> BARs, it would explain why you have not noticed the problem with
> fixup_resource ;-)
>
I seriously doubt my CPU has any resources of its own.
>
>> Do you perceive any down side to my suggested patch?
>> It seems (to me) really logical that a device can not and should not be
>> added twice to a pci bus device list.
>>
>
> The problem is the other side-effects of what you propose.
>
Could you elaborate what ominous side-effects you predict?
Liberty
On Tue, Jul 22, 2008 at 08:41:00PM +0300, Eran Liberty wrote:
> >>Do you perceive any down side to my suggested patch?
> >>It seems (to me) really logical that a device can not and should not be
> >>added twice to a pci bus device list.
> >>
> >
> >The problem is the other side-effects of what you propose.
> >
> Could you elaborate what ominous side-effects you predict?
Look at the consequences of calling fixup_resource() twice on the same
resource:
res->start = (res->start + offset) & mask;
res->end = (res->end + offset) & mask;
You'll add 'offset' to the other resources twice.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Matthew Wilcox wrote:
> Look at the consequences of calling fixup_resource() twice on the same
> resource:
>
> res->start = (res->start + offset) & mask;
> res->end = (res->end + offset) & mask;
>
> You'll add 'offset' to the other resources twice.
>
Tring to find heads and tails in the code, I dug in. This is what I understand from the overall flow... and my points of interests:
1. pci_scan_child_bus() starts with iterating over all the devfn and scanning for a device with that devfn
drivers/pci/probe.c
1056 unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
1057 {
1063 /* Go find them, Rover! */
1064 for (devfn = 0; devfn < 0x100; devfn += 8)
1065 pci_scan_slot(bus, devfn);
2. pci_scan_slot() will continue to scan all the functions that devfn might have to over
drivers/pci/probe.c
1019 int pci_scan_slot(struct pci_bus *bus, int devfn)
1020 {
1026 for (func = 0; func < 8; func++, devfn++) {
1029 dev = pci_scan_single_device(bus, devfn);
3. pci_scan_single_device() will scan / test if this dev is valid.
drivers/pci/probe.c
996 struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)
997 {
998 struct pci_dev *dev;
999
1000 dev = pci_scan_device(bus, devfn);
and add it. REGARDLESS if it is already on the pci bus list or not.
1001 if (!dev)
1002 return NULL;
1003
1004 pci_device_add(dev, bus);
1005
1006 return dev;
1007 }
This is the first thing that needs to be fixed IMHO.
4. pci_scan_child_bus() will go on to fixup the bus whether it needs fixing or not.
drivers/pci/probe.c
1072 pcibios_fixup_bus(bus);
This might be the second thing that needs amending.
5. pcibios_fixup_bus(bus) calls _pcibios_fixup_bus which will call fixup_resource() once per a bus resource (in contrast of per device on that bus) and fixes the resources of that bus.
arch/powerpc/kernel/pci-common.c
784 static void __devinit __pcibios_fixup_bus(struct pci_bus *bus)
785 {
787 struct pci_dev *dev = bus->self;
798 for (i = 0; i < PCI_BUS_NUM_RESOURCES; ++i) {
833 fixup_resource(res, dev);
834 }
835 }
850 }
As you have correctly observed without other changes this would cause trouble on a bus that has resources.
Since i am not pulling the plug on the whole bus just on selected devices I can defensively skip this part
6. Now I should be able to call pci_bus_assign_resources(bus) which will go over all the newly added device and assign resource to them.
7. Last step I should be able to pci_bus_add_devices(bus) the, now, resource fixed devices.
Steps 6 and 7 are the one I miss most over which my device wont work after being re-born.
Soooo ...
If I really wanted to make it, working my way around the current code, I would have done something like this.
bus = null;
while ((bus = pci_find_next_bus(bus)) != NULL) {
for (devfn = 0; devfn < 0x100; devfn += 8) {
for (func = 0; func < 8; func++) {
struct pci_dev *dev = <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_get_slot>pci_get_slot(struct pci_bus *bus, unsigned int devfn);
if (dev)
continue;
dev = pci_scan_single_device(bus, devfn);
if (!dev)
continue;
pci_device_add(dev, bus);
}
}
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
WOW.... first time for me to code in Mozilla Thunderbird.
Had I was this smart to begin with I would have done exactly that and go home to sleep like a decent person.
But since we are here, I feel there should be a either correction in the existing code to allow the:
pci_scan_child_bus(bus) -> pci_bus_assign_resources(bus) -> pci_bus_add_devices(bus) sequence.
Or a new helper function to the PCI that does more or less what I have just described.
If you want I can code it, patch it, and rip the glory :)
That was a long one :)
Liberty
Eran Liberty wrote:
> Soooo ...
>
> If I really wanted to make it, working my way around the current code, I would have done something like this.
>
> bus = null;
>
> while ((bus = pci_find_next_bus(bus)) != NULL) {
>
> for (devfn = 0; devfn < 0x100; devfn += 8) {
>
> for (func = 0; func < 8; func++) {
>
> struct pci_dev *dev = <http://liberty/lxr/ident?v=e500-linux-2.6.26-rc4;i=pci_get_slot>pci_get_slot(struct pci_bus *bus, unsigned int devfn);
>
> if (dev)
> continue;
>
> dev = pci_scan_single_device(bus, devfn);
>
> if (!dev)
>
> continue;
>
> pci_device_add(dev, bus);
>
> }
>
> }
>
> pci_bus_assign_resources(bus);
>
> pci_bus_add_devices(bus);
>
> }
>
Found some time to test my own idea... it seems to work. :)
Its final version looks like this:
Before reloading the programmable unit, remove all the device implemented by it
struct pci_dev *dev;
while ((dev = pci_get_device(PCI_VENDOR_ID_MYCOMP,PCI_DEVICE_ID_MYCOMP_MYDEV,NULL)) != NULL) {
pci_remove_bus_device(dev);
pci_dev_put(dev);
}
After the programmable unit is loaded, scan and add only newly added device!
struct pci_bus *bus = NULL;
while ((bus = pci_find_next_bus(bus)) != NULL) {
int devfn;
for (devfn = 0; devfn < 0x100; devfn++) {
struct pci_dev *dev=pci_get_slot(bus,devfn);
if (dev)
continue; // <---- do not scan devices already present on the bus. Missing magic line :)
(void)pci_scan_single_device(bus, devfn);
}
pci_bus_assign_resources(bus);
pci_bus_add_devices(bus);
}
So... no need for patch after all.
Unless... Hey whats this... behind it... on the side.... naaa, no need of patch... :(
Liberty
On Sun, Jul 27, 2008 at 02:01:04PM +0300, Eran Liberty wrote:
> Found some time to test my own idea... it seems to work. :)
Is there a reason you haven't tested the tree I posted?
: I've been working in this area recently; here's my current tree:
: http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=pci-read-base
: I've had a quick hack at doing what I want, but I haven't tested it.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
After compiling a kernel with ftrace I started to experience all sorts
of crashes.
I have a powerpc env which closly resemble mpc8548cds_defconfig.
I have produced the crash in seconds by issuing:
> while [ 1 ] ; do find / > /dev/null ; echo . ; done
Liberty
here are some of the kernel crashes:
---
~ # find /proc/ -name kft
find: /proc/150/net/softnet_stat: No such file or direOops: Exception in
kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in:
NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
REGS: ddfc3d50 TRAP: 0700 Not tainted (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 20002284 XER: 20000000
TASK = ddcccde0[1699] 'find' THREAD: ddfc2000
GPR00: 00000000 ddfc3e00 ddcccde0 dd895580 ddfc3e30 c00b5da0 ddfc3e90
00000003
GPR08: c00ece54 00000370 00145505 00000003 80002282 100ad874 10083ca0
10083cd8
GPR16: 10083cd0 00000008 00000000 c00b5da0 ddfc3f18 c00ece54 00000000
df465720
GPR24: d7431900 ddfc3e30 dd895580 c0380000 dd895580 ddfc3e30 00000000
ddfc3e00
NIP [c00bd6fc] d_lookup+0x40/0x90
LR [c00bd6fc] d_lookup+0x40/0x90
Call Trace:
[ddfc3e00] [c006ade4] rcu_process_callbacks+0x48/0x60 (unreliable)
[ddfc3e20] [c00eb514] proc_fill_cache+0x94/0x1b0
[ddfc3e80] [c00ef720] proc_task_readdir+0x294/0x3c4
[ddfc3ee0] [c00b60fc] vfs_readdir+0xb8/0xd8
[ddfc3f10] [c00b619c] sys_getdents64+0x80/0x104
[ddfc3f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
ctory
---[ end trace 6b059a7f5ba5a55a ]---
---
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in: gplonly_api
NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
REGS: df52dc40 TRAP: 0700 Not tainted (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 20082422 XER: 20000000
TASK = ddc2e060[2294] 'eeprom' THREAD: df52c000
GPR00: 00000000 df52dcf0 ddc2e060 dd82c700 df52dd58 df52dd48 dd82c75e
00000000
GPR08: c0800000 000311f0 0000ffff df52dd10 20002488 1001e75c 100b0000
1008e90c
GPR16: bfd15a50 df52de6c df52dd58 fffffff4 c0380000 df52dd50 df52dd48
dd9303fc
GPR24: dc3a6800 dd930390 df52dd58 c0380000 dd82c700 df52dd58 00000006
df52dcf0
NIP [c00bd6fc] d_lookup+0x40/0x90
LR [c00bd6fc] d_lookup+0x40/0x90
Call Trace:
[df52dcf0] [df52dd48] 0xdf52dd48 (unreliable)
[df52dd10] [c00b07a0] do_lookup+0xe8/0x220
[df52dd40] [c00b222c] __link_path_walk+0x174/0xd54
[df52ddb0] [c00b2e64] path_walk+0x58/0xe0
[df52dde0] [c00b2fd4] do_path_lookup+0x78/0x13c
[df52de10] [c00b3f3c] __path_lookup_intent_open+0x68/0xdc
[df52de40] [c00b3fd8] path_lookup_open+0x28/0x40
[df52de50] [c00b4234] do_filp_open+0xa0/0x798
[df52df00] [c00a3b80] do_sys_open+0x70/0x114
[df52df30] [c00a3c98] sys_open+0x38/0x50
[df52df40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
om_version(Canno---[ end trace 0cef3c8cb49d09de ]---
---
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
shm_freescale(P) ath_remote_regs_freescale(P) exsw
load_local_fpga_freescale 8021q poe_fi
NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
REGS: dd555c50 TRAP: 0700 Tainted: P (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
TASK = df4669a0[4976] 'find' THREAD: dd554000
GPR00: 00000000 dd555d00 df4669a0 dd82ae00 dd555d68 dd555d58 dd82ae5b
100234ec
GPR08: c0800000 0002dd0c 0000ffff dd555d20 24000288 100ad874 100936f8
1008a1d0
GPR16: 10083f80 dd555e2c dd555d68 fffffff4 c0380000 dd555d60 dd555d58
dd9565d4
GPR24: dc3db480 dd956568 dd555d68 c0380000 dd82ae00 dd555d68 00000000
dd555d00
NIP [c00bd6fc] d_lookup+0x40/0x90
LR [c00bd6fc] d_lookup+0x40/0x90
Call Trace:
[dd555d00] [dd555d58] 0xdd555d58 (unreliable)
[dd555d20] [c00b07a0] do_lookup+0xe8/0x220
[dd555d50] [c00b265c] __link_path_walk+0x5a4/0xd54
[dd555dc0] [c00b2e64] path_walk+0x58/0xe0
[dd555df0] [c00b2fd4] do_path_lookup+0x78/0x13c
[dd555e20] [c00b3cd0] user_path_at+0x64/0xac
[dd555e90] [c00aac04] vfs_lstat_fd+0x34/0x74
[dd555ec0] [c00aacd8] vfs_lstat+0x30/0x48
[dd555ed0] [c00aad20] sys_lstat64+0x30/0x5c
[dd555f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
76/auxv
/proc/4---[ end trace 0deef0827ce9df4b ]---
---
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
shm_freescale(P) ath_remote_regs_freescale(P) exsw
load_local_fpga_freescale 8021q poe_fi
NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
REGS: d57e7c40 TRAP: 0700 Tainted: P (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 24482482 XER: 20000000
TASK = db3cc940[2054] 'eapd' THREAD: d57e6000
GPR00: 00000000 d57e7cf0 db3cc940 dd82a500 d57e7d58 d57e7d48 dd82a55a
00000000
GPR08: c0800000 00016c64 0000ffff d57e7d10 24422488 1008f700 4802cd38
100763e8
GPR16: 4801d818 d57e7e6c d57e7d58 fffffff4 c0380000 d57e7d50 d57e7d48
dd956c74
GPR24: dc3db680 dd956c08 d57e7d58 c0380000 dd82a500 d57e7d58 00000000
d57e7cf0
NIP [c00bd6fc] d_lookup+0x40/0x90
LR [c00bd6fc] d_lookup+0x40/0x90
Call Trace:
[d57e7cf0] [d57e7d48] 0xd57e7d48 (unreliable)
[d57e7d10] [c00b07a0] do_lookup+0xe8/0x220
[d57e7d40] [c00b265c] __link_path_walk+0x5a4/0xd54
[d57e7db0] [c00b2e64] path_walk+0x58/0xe0
[d57e7de0] [c00b2fd4] do_path_lookup+0x78/0x13c
[d57e7e10] [c00b3f3c] __path_lookup_intent_open+0x68/0xdc
[d57e7e40] [c00b3fd8] path_lookup_open+0x28/0x40
[d57e7e50] [c00b4234] do_filp_open+0xa0/0x798
[d57e7f00] [c00a3b80] do_sys_open+0x70/0x114
[d57e7f30] [c00a3c98] sys_open+0x38/0x50
[d57e7f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
---[ end trace ffb4a6e37d78370a ]---
---
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
shm_freescale(P) ath_remote_regs_freescale(P) exsw
load_local_fpga_freescale 8021q poe_fi
NIP: c00bd6fc LR: c00bd6fc CTR: c00bd6c8
REGS: decd5cf0 TRAP: 0700 Tainted: P (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 22228882 XER: 20000000
TASK = d9f5a9a0[6897] 'sh' THREAD: decd4000
GPR00: 00000000 decd5da0 d9f5a9a0 dd801180 decd5de8 00000000 00000004
fffffff9
GPR08: fffffffa 00000000 00d5ca28 decd5d90 2ccc1686 100ad874 1fffb200
00000000
GPR16: 007fff00 decd5ec4 00000008 c02fe298 00200200 00000000 c031aa34
decd5de8
GPR24: decd5df4 00001af2 dd507d00 c0380000 dd801180 decd5de8 00000000
decd5da0
NIP [c00bd6fc] d_lookup+0x40/0x90
LR [c00bd6fc] d_lookup+0x40/0x90
Call Trace:
[decd5dc0] [c00bd7d8] d_hash_and_lookup+0x8c/0xc4
[decd5de0] [c00ed728] proc_flush_task+0xb4/0x268
[decd5e40] [c0032894] release_task+0x6c/0x39c
[decd5e70] [c0033238] wait_consider_task+0x674/0x920
[decd5eb0] [c0033610] do_wait+0x12c/0x3d4
[decd5f20] [c0033954] sys_wait4+0x9c/0xe8
[decd5f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
---[ end trace 00106ff4ec3b9c22 ]---
Eran Liberty wrote:
> After compiling a kernel with ftrace I started to experience all sorts
> of crashes.
Just to make sure...
ftrace enables markers too, and RCU has tracing with the markers. This
may not be the problem, but I just want to eliminate as many variables
as possible.
Could you disable ftrace, but keep the markers on too. Also, could you
enable ftrace again and turn on the FTRACE_STARTUP_TEST.
Thanks,
-- Steve
> I have a powerpc env which closly resemble mpc8548cds_defconfig.
>
> I have produced the crash in seconds by issuing:
> > while [ 1 ] ; do find / > /dev/null ; echo . ; done
>
> Liberty
>
> here are some of the kernel crashes:
> ---
> ~ # find /proc/ -name kft
> find: /proc/150/net/softnet_stat: No such file or direOops: Exception
> in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in:
> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
> REGS: ddfc3d50 TRAP: 0700 Not tainted (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 20002284 XER: 20000000
> TASK = ddcccde0[1699] 'find' THREAD: ddfc2000
> GPR00: 00000000 ddfc3e00 ddcccde0 dd895580 ddfc3e30 c00b5da0 ddfc3e90
> 00000003
> GPR08: c00ece54 00000370 00145505 00000003 80002282 100ad874 10083ca0
> 10083cd8
> GPR16: 10083cd0 00000008 00000000 c00b5da0 ddfc3f18 c00ece54 00000000
> df465720
> GPR24: d7431900 ddfc3e30 dd895580 c0380000 dd895580 ddfc3e30 00000000
> ddfc3e00
> NIP [c00bd6fc] d_lookup+0x40/0x90
> LR [c00bd6fc] d_lookup+0x40/0x90
> Call Trace:
> [ddfc3e00] [c006ade4] rcu_process_callbacks+0x48/0x60 (unreliable)
> [ddfc3e20] [c00eb514] proc_fill_cache+0x94/0x1b0
> [ddfc3e80] [c00ef720] proc_task_readdir+0x294/0x3c4
> [ddfc3ee0] [c00b60fc] vfs_readdir+0xb8/0xd8
> [ddfc3f10] [c00b619c] sys_getdents64+0x80/0x104
> [ddfc3f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
> ctory
> ---[ end trace 6b059a7f5ba5a55a ]---
>
> ---
>
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in: gplonly_api
> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
> REGS: df52dc40 TRAP: 0700 Not tainted (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 20082422 XER: 20000000
> TASK = ddc2e060[2294] 'eeprom' THREAD: df52c000
> GPR00: 00000000 df52dcf0 ddc2e060 dd82c700 df52dd58 df52dd48 dd82c75e
> 00000000
> GPR08: c0800000 000311f0 0000ffff df52dd10 20002488 1001e75c 100b0000
> 1008e90c
> GPR16: bfd15a50 df52de6c df52dd58 fffffff4 c0380000 df52dd50 df52dd48
> dd9303fc
> GPR24: dc3a6800 dd930390 df52dd58 c0380000 dd82c700 df52dd58 00000006
> df52dcf0
> NIP [c00bd6fc] d_lookup+0x40/0x90
> LR [c00bd6fc] d_lookup+0x40/0x90
> Call Trace:
> [df52dcf0] [df52dd48] 0xdf52dd48 (unreliable)
> [df52dd10] [c00b07a0] do_lookup+0xe8/0x220
> [df52dd40] [c00b222c] __link_path_walk+0x174/0xd54
> [df52ddb0] [c00b2e64] path_walk+0x58/0xe0
> [df52dde0] [c00b2fd4] do_path_lookup+0x78/0x13c
> [df52de10] [c00b3f3c] __path_lookup_intent_open+0x68/0xdc
> [df52de40] [c00b3fd8] path_lookup_open+0x28/0x40
> [df52de50] [c00b4234] do_filp_open+0xa0/0x798
> [df52df00] [c00a3b80] do_sys_open+0x70/0x114
> [df52df30] [c00a3c98] sys_open+0x38/0x50
> [df52df40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
> om_version(Canno---[ end trace 0cef3c8cb49d09de ]---
>
> ---
>
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
> shm_freescale(P) ath_remote_regs_freescale(P) exsw
> load_local_fpga_freescale 8021q poe_fi
> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
> REGS: dd555c50 TRAP: 0700 Tainted: P (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
> TASK = df4669a0[4976] 'find' THREAD: dd554000
> GPR00: 00000000 dd555d00 df4669a0 dd82ae00 dd555d68 dd555d58 dd82ae5b
> 100234ec
> GPR08: c0800000 0002dd0c 0000ffff dd555d20 24000288 100ad874 100936f8
> 1008a1d0
> GPR16: 10083f80 dd555e2c dd555d68 fffffff4 c0380000 dd555d60 dd555d58
> dd9565d4
> GPR24: dc3db480 dd956568 dd555d68 c0380000 dd82ae00 dd555d68 00000000
> dd555d00
> NIP [c00bd6fc] d_lookup+0x40/0x90
> LR [c00bd6fc] d_lookup+0x40/0x90
> Call Trace:
> [dd555d00] [dd555d58] 0xdd555d58 (unreliable)
> [dd555d20] [c00b07a0] do_lookup+0xe8/0x220
> [dd555d50] [c00b265c] __link_path_walk+0x5a4/0xd54
> [dd555dc0] [c00b2e64] path_walk+0x58/0xe0
> [dd555df0] [c00b2fd4] do_path_lookup+0x78/0x13c
> [dd555e20] [c00b3cd0] user_path_at+0x64/0xac
> [dd555e90] [c00aac04] vfs_lstat_fd+0x34/0x74
> [dd555ec0] [c00aacd8] vfs_lstat+0x30/0x48
> [dd555ed0] [c00aad20] sys_lstat64+0x30/0x5c
> [dd555f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
> 76/auxv
> /proc/4---[ end trace 0deef0827ce9df4b ]---
>
> ---
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
> shm_freescale(P) ath_remote_regs_freescale(P) exsw
> load_local_fpga_freescale 8021q poe_fi
> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
> REGS: d57e7c40 TRAP: 0700 Tainted: P (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 24482482 XER: 20000000
> TASK = db3cc940[2054] 'eapd' THREAD: d57e6000
> GPR00: 00000000 d57e7cf0 db3cc940 dd82a500 d57e7d58 d57e7d48 dd82a55a
> 00000000
> GPR08: c0800000 00016c64 0000ffff d57e7d10 24422488 1008f700 4802cd38
> 100763e8
> GPR16: 4801d818 d57e7e6c d57e7d58 fffffff4 c0380000 d57e7d50 d57e7d48
> dd956c74
> GPR24: dc3db680 dd956c08 d57e7d58 c0380000 dd82a500 d57e7d58 00000000
> d57e7cf0
> NIP [c00bd6fc] d_lookup+0x40/0x90
> LR [c00bd6fc] d_lookup+0x40/0x90
> Call Trace:
> [d57e7cf0] [d57e7d48] 0xd57e7d48 (unreliable)
> [d57e7d10] [c00b07a0] do_lookup+0xe8/0x220
> [d57e7d40] [c00b265c] __link_path_walk+0x5a4/0xd54
> [d57e7db0] [c00b2e64] path_walk+0x58/0xe0
> [d57e7de0] [c00b2fd4] do_path_lookup+0x78/0x13c
> [d57e7e10] [c00b3f3c] __path_lookup_intent_open+0x68/0xdc
> [d57e7e40] [c00b3fd8] path_lookup_open+0x28/0x40
> [d57e7e50] [c00b4234] do_filp_open+0xa0/0x798
> [d57e7f00] [c00a3b80] do_sys_open+0x70/0x114
> [d57e7f30] [c00a3c98] sys_open+0x38/0x50
> [d57e7f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
> ---[ end trace ffb4a6e37d78370a ]---
>
> ---
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
> shm_freescale(P) ath_remote_regs_freescale(P) exsw
> load_local_fpga_freescale 8021q poe_fi
> NIP: c00bd6fc LR: c00bd6fc CTR: c00bd6c8
> REGS: decd5cf0 TRAP: 0700 Tainted: P (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 22228882 XER: 20000000
> TASK = d9f5a9a0[6897] 'sh' THREAD: decd4000
> GPR00: 00000000 decd5da0 d9f5a9a0 dd801180 decd5de8 00000000 00000004
> fffffff9
> GPR08: fffffffa 00000000 00d5ca28 decd5d90 2ccc1686 100ad874 1fffb200
> 00000000
> GPR16: 007fff00 decd5ec4 00000008 c02fe298 00200200 00000000 c031aa34
> decd5de8
> GPR24: decd5df4 00001af2 dd507d00 c0380000 dd801180 decd5de8 00000000
> decd5da0
> NIP [c00bd6fc] d_lookup+0x40/0x90
> LR [c00bd6fc] d_lookup+0x40/0x90
> Call Trace:
> [decd5dc0] [c00bd7d8] d_hash_and_lookup+0x8c/0xc4
> [decd5de0] [c00ed728] proc_flush_task+0xb4/0x268
> [decd5e40] [c0032894] release_task+0x6c/0x39c
> [decd5e70] [c0033238] wait_consider_task+0x674/0x920
> [decd5eb0] [c0033610] do_wait+0x12c/0x3d4
> [decd5f20] [c0033954] sys_wait4+0x9c/0xe8
> [decd5f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
> ---[ end trace 00106ff4ec3b9c22 ]---
>
>
>
>
* Steven Rostedt ([email protected]) wrote:
> Eran Liberty wrote:
>> After compiling a kernel with ftrace I started to experience all sorts of
>> crashes.
>
> Just to make sure...
>
> ftrace enables markers too, and RCU has tracing with the markers. This may
> not be the problem, but I just want to eliminate as many variables as
> possible.
> Could you disable ftrace, but keep the markers on too. Also, could you
> enable ftrace again and turn on the FTRACE_STARTUP_TEST.
>
Steve ? I just noticed this :
ftrace_modify_code(unsigned long ip, unsigned char *old_code,
unsigned char *new_code)
{
unsigned replaced;
unsigned old = *(unsigned *)old_code;
unsigned new = *(unsigned *)new_code;
int faulted = 0;
/*
* Note: Due to modules and __init, code can
* disappear and change, we need to protect against faulting
* as well as code changing.
*
* No real locking needed, this code is run through
* kstop_machine.
*/
asm volatile (
"1: lwz %1, 0(%2)\n"
" cmpw %1, %5\n"
" bne 2f\n"
" stwu %3, 0(%2)\n"
"2:\n"
".section .fixup, \"ax\"\n"
"3: li %0, 1\n"
" b 2b\n"
".previous\n"
".section __ex_table,\"a\"\n"
_ASM_ALIGN "\n"
_ASM_PTR "1b, 3b\n"
".previous"
: "=r"(faulted), "=r"(replaced)
: "r"(ip), "r"(new),
"0"(faulted), "r"(old)
: "memory");
if (replaced != old && replaced != new)
faulted = 2;
if (!faulted)
flush_icache_range(ip, ip + 8);
return faulted;
}
What happens if you are really unlucky and get the following execution
sequence ?
Load module A at address 0xddfc3e00
Populate ftrace function table while the system runs
Unload module A
Load module B at address 0xddfc3e00
Activate ftrace
-> At that point, since there is another module loaded in the same
address space, it won't fault. If there happens to be an instruction
which looks exactly like the instruction we are replacing at the same
location, we are introducing a bug. True both on x86 ans powerpc...
Mathieu
> Thanks,
>
> -- Steve
>
>> I have a powerpc env which closly resemble mpc8548cds_defconfig.
>>
>> I have produced the crash in seconds by issuing:
>> > while [ 1 ] ; do find / > /dev/null ; echo . ; done
>>
>> Liberty
>>
>> here are some of the kernel crashes:
>> ---
>> ~ # find /proc/ -name kft
>> find: /proc/150/net/softnet_stat: No such file or direOops: Exception in
>> kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in:
>> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
>> REGS: ddfc3d50 TRAP: 0700 Not tainted (2.6.27-rc2)
>> MSR: 00029000 <EE,ME> CR: 20002284 XER: 20000000
>> TASK = ddcccde0[1699] 'find' THREAD: ddfc2000
>> GPR00: 00000000 ddfc3e00 ddcccde0 dd895580 ddfc3e30 c00b5da0 ddfc3e90
>> 00000003
>> GPR08: c00ece54 00000370 00145505 00000003 80002282 100ad874 10083ca0
>> 10083cd8
>> GPR16: 10083cd0 00000008 00000000 c00b5da0 ddfc3f18 c00ece54 00000000
>> df465720
>> GPR24: d7431900 ddfc3e30 dd895580 c0380000 dd895580 ddfc3e30 00000000
>> ddfc3e00
>> NIP [c00bd6fc] d_lookup+0x40/0x90
>> LR [c00bd6fc] d_lookup+0x40/0x90
>> Call Trace:
>> [ddfc3e00] [c006ade4] rcu_process_callbacks+0x48/0x60 (unreliable)
>> [ddfc3e20] [c00eb514] proc_fill_cache+0x94/0x1b0
>> [ddfc3e80] [c00ef720] proc_task_readdir+0x294/0x3c4
>> [ddfc3ee0] [c00b60fc] vfs_readdir+0xb8/0xd8
>> [ddfc3f10] [c00b619c] sys_getdents64+0x80/0x104
>> [ddfc3f40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
>> ctory
>> ---[ end trace 6b059a7f5ba5a55a ]---
>>
>> ---
>>
>> Oops: Exception in kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in: gplonly_api
>> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
>> REGS: df52dc40 TRAP: 0700 Not tainted (2.6.27-rc2)
>> MSR: 00029000 <EE,ME> CR: 20082422 XER: 20000000
>> TASK = ddc2e060[2294] 'eeprom' THREAD: df52c000
>> GPR00: 00000000 df52dcf0 ddc2e060 dd82c700 df52dd58 df52dd48 dd82c75e
>> 00000000
>> GPR08: c0800000 000311f0 0000ffff df52dd10 20002488 1001e75c 100b0000
>> 1008e90c
>> GPR16: bfd15a50 df52de6c df52dd58 fffffff4 c0380000 df52dd50 df52dd48
>> dd9303fc
>> GPR24: dc3a6800 dd930390 df52dd58 c0380000 dd82c700 df52dd58 00000006
>> df52dcf0
>> NIP [c00bd6fc] d_lookup+0x40/0x90
>> LR [c00bd6fc] d_lookup+0x40/0x90
>> Call Trace:
>> [df52dcf0] [df52dd48] 0xdf52dd48 (unreliable)
>> [df52dd10] [c00b07a0] do_lookup+0xe8/0x220
>> [df52dd40] [c00b222c] __link_path_walk+0x174/0xd54
>> [df52ddb0] [c00b2e64] path_walk+0x58/0xe0
>> [df52dde0] [c00b2fd4] do_path_lookup+0x78/0x13c
>> [df52de10] [c00b3f3c] __path_lookup_intent_open+0x68/0xdc
>> [df52de40] [c00b3fd8] path_lookup_open+0x28/0x40
>> [df52de50] [c00b4234] do_filp_open+0xa0/0x798
>> [df52df00] [c00a3b80] do_sys_open+0x70/0x114
>> [df52df30] [c00a3c98] sys_open+0x38/0x50
>> [df52df40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
>> om_version(Canno---[ end trace 0cef3c8cb49d09de ]---
>>
>> ---
>>
>> Oops: Exception in kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
>> shm_freescale(P) ath_remote_regs_freescale(P) exsw
>> load_local_fpga_freescale 8021q poe_fi
>> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
>> REGS: dd555c50 TRAP: 0700 Tainted: P (2.6.27-rc2)
>> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
>> TASK = df4669a0[4976] 'find' THREAD: dd554000
>> GPR00: 00000000 dd555d00 df4669a0 dd82ae00 dd555d68 dd555d58 dd82ae5b
>> 100234ec
>> GPR08: c0800000 0002dd0c 0000ffff dd555d20 24000288 100ad874 100936f8
>> 1008a1d0
>> GPR16: 10083f80 dd555e2c dd555d68 fffffff4 c0380000 dd555d60 dd555d58
>> dd9565d4
>> GPR24: dc3db480 dd956568 dd555d68 c0380000 dd82ae00 dd555d68 00000000
>> dd555d00
>> NIP [c00bd6fc] d_lookup+0x40/0x90
>> LR [c00bd6fc] d_lookup+0x40/0x90
>> Call Trace:
>> [dd555d00] [dd555d58] 0xdd555d58 (unreliable)
>> [dd555d20] [c00b07a0] do_lookup+0xe8/0x220
>> [dd555d50] [c00b265c] __link_path_walk+0x5a4/0xd54
>> [dd555dc0] [c00b2e64] path_walk+0x58/0xe0
>> [dd555df0] [c00b2fd4] do_path_lookup+0x78/0x13c
>> [dd555e20] [c00b3cd0] user_path_at+0x64/0xac
>> [dd555e90] [c00aac04] vfs_lstat_fd+0x34/0x74
>> [dd555ec0] [c00aacd8] vfs_lstat+0x30/0x48
>> [dd555ed0] [c00aad20] sys_lstat64+0x30/0x5c
>> [dd555f40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
>> 76/auxv
>> /proc/4---[ end trace 0deef0827ce9df4b ]---
>>
>> ---
>> Oops: Exception in kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
>> shm_freescale(P) ath_remote_regs_freescale(P) exsw
>> load_local_fpga_freescale 8021q poe_fi
>> NIP: c00bd6fc LR: c00bd6fc CTR: 00000000
>> REGS: d57e7c40 TRAP: 0700 Tainted: P (2.6.27-rc2)
>> MSR: 00029000 <EE,ME> CR: 24482482 XER: 20000000
>> TASK = db3cc940[2054] 'eapd' THREAD: d57e6000
>> GPR00: 00000000 d57e7cf0 db3cc940 dd82a500 d57e7d58 d57e7d48 dd82a55a
>> 00000000
>> GPR08: c0800000 00016c64 0000ffff d57e7d10 24422488 1008f700 4802cd38
>> 100763e8
>> GPR16: 4801d818 d57e7e6c d57e7d58 fffffff4 c0380000 d57e7d50 d57e7d48
>> dd956c74
>> GPR24: dc3db680 dd956c08 d57e7d58 c0380000 dd82a500 d57e7d58 00000000
>> d57e7cf0
>> NIP [c00bd6fc] d_lookup+0x40/0x90
>> LR [c00bd6fc] d_lookup+0x40/0x90
>> Call Trace:
>> [d57e7cf0] [d57e7d48] 0xd57e7d48 (unreliable)
>> [d57e7d10] [c00b07a0] do_lookup+0xe8/0x220
>> [d57e7d40] [c00b265c] __link_path_walk+0x5a4/0xd54
>> [d57e7db0] [c00b2e64] path_walk+0x58/0xe0
>> [d57e7de0] [c00b2fd4] do_path_lookup+0x78/0x13c
>> [d57e7e10] [c00b3f3c] __path_lookup_intent_open+0x68/0xdc
>> [d57e7e40] [c00b3fd8] path_lookup_open+0x28/0x40
>> [d57e7e50] [c00b4234] do_filp_open+0xa0/0x798
>> [d57e7f00] [c00a3b80] do_sys_open+0x70/0x114
>> [d57e7f30] [c00a3c98] sys_open+0x38/0x50
>> [d57e7f40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
>> ---[ end trace ffb4a6e37d78370a ]---
>>
>> ---
>> Oops: Exception in kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P)
>> shm_freescale(P) ath_remote_regs_freescale(P) exsw
>> load_local_fpga_freescale 8021q poe_fi
>> NIP: c00bd6fc LR: c00bd6fc CTR: c00bd6c8
>> REGS: decd5cf0 TRAP: 0700 Tainted: P (2.6.27-rc2)
>> MSR: 00029000 <EE,ME> CR: 22228882 XER: 20000000
>> TASK = d9f5a9a0[6897] 'sh' THREAD: decd4000
>> GPR00: 00000000 decd5da0 d9f5a9a0 dd801180 decd5de8 00000000 00000004
>> fffffff9
>> GPR08: fffffffa 00000000 00d5ca28 decd5d90 2ccc1686 100ad874 1fffb200
>> 00000000
>> GPR16: 007fff00 decd5ec4 00000008 c02fe298 00200200 00000000 c031aa34
>> decd5de8
>> GPR24: decd5df4 00001af2 dd507d00 c0380000 dd801180 decd5de8 00000000
>> decd5da0
>> NIP [c00bd6fc] d_lookup+0x40/0x90
>> LR [c00bd6fc] d_lookup+0x40/0x90
>> Call Trace:
>> [decd5dc0] [c00bd7d8] d_hash_and_lookup+0x8c/0xc4
>> [decd5de0] [c00ed728] proc_flush_task+0xb4/0x268
>> [decd5e40] [c0032894] release_task+0x6c/0x39c
>> [decd5e70] [c0033238] wait_consider_task+0x674/0x920
>> [decd5eb0] [c0033610] do_wait+0x12c/0x3d4
>> [decd5f20] [c0033954] sys_wait4+0x9c/0xe8
>> [decd5f40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b52a0
>> ---[ end trace 00106ff4ec3b9c22 ]---
>>
>>
>>
>>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
>
> Steve ? I just noticed this :
>
>
> ftrace_modify_code(unsigned long ip, unsigned char *old_code,
> unsigned char *new_code)
> {
> unsigned replaced;
> unsigned old = *(unsigned *)old_code;
> unsigned new = *(unsigned *)new_code;
> int faulted = 0;
>
> /*
> * Note: Due to modules and __init, code can
> * disappear and change, we need to protect against faulting
> * as well as code changing.
> *
> * No real locking needed, this code is run through
> * kstop_machine.
> */
> asm volatile (
> "1: lwz %1, 0(%2)\n"
> " cmpw %1, %5\n"
> " bne 2f\n"
> " stwu %3, 0(%2)\n"
> "2:\n"
> ".section .fixup, \"ax\"\n"
> "3: li %0, 1\n"
> " b 2b\n"
> ".previous\n"
> ".section __ex_table,\"a\"\n"
> _ASM_ALIGN "\n"
> _ASM_PTR "1b, 3b\n"
> ".previous"
> : "=r"(faulted), "=r"(replaced)
> : "r"(ip), "r"(new),
> "0"(faulted), "r"(old)
> : "memory");
>
> if (replaced != old && replaced != new)
> faulted = 2;
>
> if (!faulted)
> flush_icache_range(ip, ip + 8);
>
> return faulted;
> }
>
> What happens if you are really unlucky and get the following execution
> sequence ?
>
>
> Load module A at address 0xddfc3e00
> Populate ftrace function table while the system runs
> Unload module A
> Load module B at address 0xddfc3e00
> Activate ftrace
> -> At that point, since there is another module loaded in the same
> address space, it won't fault. If there happens to be an instruction
> which looks exactly like the instruction we are replacing at the same
> location, we are introducing a bug. True both on x86 ans powerpc...
>
>
Hi Mathieu,
Yep I know of this issue, and it is very unlikely it would happen. But
that's not good enough, so this is why I did this:
http://marc.info/?l=linux-kernel&m=121876928124384&w=2
http://marc.info/?l=linux-kernel&m=121876938524523&w=2
;-)
On powerpc it would be less likely an issue since the code segments are
all 4 bytes aligned. And a call being replaced would be a call to mcount
(relative jump). A call to mcount would be the same for both. Then
again, we could be replacing a nop, but most likely that wouldn't hurt
either, since nops are seldom used, and if we did call mcount, it would
probably not hurt. But it would be an issue if Module B happen to put a
data section that matched the code to jmp to mcount or a nop.
-- Steve
* Steven Rostedt ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>>
>> Steve ? I just noticed this :
>>
>>
>> ftrace_modify_code(unsigned long ip, unsigned char *old_code,
>> unsigned char *new_code)
>> {
>> unsigned replaced;
>> unsigned old = *(unsigned *)old_code;
>> unsigned new = *(unsigned *)new_code;
>> int faulted = 0;
>>
>> /*
>> * Note: Due to modules and __init, code can
>> * disappear and change, we need to protect against faulting
>> * as well as code changing.
>> *
>> * No real locking needed, this code is run through
>> * kstop_machine.
>> */
>> asm volatile (
>> "1: lwz %1, 0(%2)\n"
>> " cmpw %1, %5\n"
>> " bne 2f\n"
>> " stwu %3, 0(%2)\n"
>> "2:\n"
>> ".section .fixup, \"ax\"\n"
>> "3: li %0, 1\n"
>> " b 2b\n"
>> ".previous\n"
>> ".section __ex_table,\"a\"\n"
>> _ASM_ALIGN "\n"
>> _ASM_PTR "1b, 3b\n"
>> ".previous"
>> : "=r"(faulted), "=r"(replaced)
>> : "r"(ip), "r"(new),
>> "0"(faulted), "r"(old)
>> : "memory");
>>
>> if (replaced != old && replaced != new)
>> faulted = 2;
>>
>> if (!faulted)
>> flush_icache_range(ip, ip + 8);
>>
>> return faulted;
>> }
>>
>> What happens if you are really unlucky and get the following execution
>> sequence ?
>>
>>
>> Load module A at address 0xddfc3e00
>> Populate ftrace function table while the system runs
>> Unload module A
>> Load module B at address 0xddfc3e00
>> Activate ftrace
>> -> At that point, since there is another module loaded in the same
>> address space, it won't fault. If there happens to be an instruction
>> which looks exactly like the instruction we are replacing at the same
>> location, we are introducing a bug. True both on x86 ans powerpc...
>>
>>
>
> Hi Mathieu,
>
> Yep I know of this issue, and it is very unlikely it would happen. But
> that's not good enough, so this is why I did this:
>
> http://marc.info/?l=linux-kernel&m=121876928124384&w=2
> http://marc.info/?l=linux-kernel&m=121876938524523&w=2
>
> ;-)
>
> On powerpc it would be less likely an issue since the code segments are all
> 4 bytes aligned. And a call being replaced would be a call to mcount
> (relative jump). A call to mcount would be the same for both. Then again,
> we could be replacing a nop, but most likely that wouldn't hurt either,
> since nops are seldom used, and if we did call mcount, it would probably
> not hurt. But it would be an issue if Module B happen to put a data section
> that matched the code to jmp to mcount or a nop.
>
Ah, I did not see this one passing by :) That's the right solution
indeed. I guess it's not in 2.6.27-rc2/rc3 though ? I wonder if the
bug can be repeated with a module-free kernel ?
Mathieu
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> asm volatile (
> "1: lwz %1, 0(%2)\n"
> " cmpw %1, %5\n"
> " bne 2f\n"
> " stwu %3, 0(%2)\n"
> "2:\n"
> ".section .fixup, \"ax\"\n"
> "3: li %0, 1\n"
> " b 2b\n"
> ".previous\n"
> ".section __ex_table,\"a\"\n"
> _ASM_ALIGN "\n"
> _ASM_PTR "1b, 3b\n"
> ".previous"
> : "=r"(faulted), "=r"(replaced)
> : "r"(ip), "r"(new),
> "0"(faulted), "r"(old)
> : "memory");
Some (most likely unrelated) nits in the above inline asm:
Should use a "b" constraint for %2, or you could get r0. Or, use an "m"
constraint with %U2%X2 after the lwz/stw. Why stwu with an offset of
zero, BTW?
%1 also needs to be an early clobber.
-Scott
On Mon, 18 Aug 2008, Scott Wood wrote:
> Mathieu Desnoyers wrote:
> > asm volatile (
> > "1: lwz %1, 0(%2)\n"
> > " cmpw %1, %5\n"
> > " bne 2f\n"
> > " stwu %3, 0(%2)\n"
> > "2:\n"
> > ".section .fixup, \"ax\"\n"
> > "3: li %0, 1\n"
> > " b 2b\n"
> > ".previous\n"
> > ".section __ex_table,\"a\"\n"
> > _ASM_ALIGN "\n"
> > _ASM_PTR "1b, 3b\n"
> > ".previous"
> > : "=r"(faulted), "=r"(replaced)
> > : "r"(ip), "r"(new),
> > "0"(faulted), "r"(old)
> > : "memory");
>
> Some (most likely unrelated) nits in the above inline asm:
Thanks,
>
> Should use a "b" constraint for %2, or you could get r0.
I will make an updated patch.
> Or, use an "m"
> constraint with %U2%X2 after the lwz/stw.
The 'b' seems easier ;-)
> Why stwu with an offset of zero,
> BTW?
Because it worked :-) Really, it has been a while since I did any PPC
assembly, and I didn't break out the old reference manuals to do this.
I simply looked at other asm code, and followed suit. I compiled and
tested it, and it worked. I did make a disclaimer about my rusty PPC
knowledge when I posted the code.
>
> %1 also needs to be an early clobber.
Not exactly sure what you mean by the above.
-- Steve
Steven Rostedt wrote:
> Eran Liberty wrote:
>> After compiling a kernel with ftrace I started to experience all
>> sorts of crashes.
>
> Just to make sure...
>
> ftrace enables markers too, and RCU has tracing with the markers. This
> may not be the problem, but I just want to eliminate as many variables
> as possible.
> Could you disable ftrace, but keep the markers on too. Also, could
> you enable ftrace again and turn on the FTRACE_STARTUP_TEST.
for the fun of it I took out all my propriety modules; so now its a non
tainted kernel.
Here is the matrix:
!FTRACE x !MARKERS => stable
!FTRACE x MARKERS => stable
FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
FTRACE x MARKERS => unstable
FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed
Testing tracer sched_switch: PASSED
Testing tracer ftrace: PASSED
Testing dynamic ftrace: PASSED
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in:
NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
TASK = ddcce060[1707] 'find' THREAD: dd5b0000
GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b
100234ec
GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8
1008a1d0
GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58
dd802084
GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000
dd5b1d00
NIP [c00bbb20] d_lookup+0x40/0x90
LR [c00bbb20] d_lookup+0x40/0x90
Call Trace:
[dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
[dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
[dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
[dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
[dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
[dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
[dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
[dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
[dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
[dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
---[ end trace 1eb8fd7adac2bb65 ]---
Liberty
Steven Rostedt wrote:
>> Should use a "b" constraint for %2, or you could get r0.
>
> I will make an updated patch.
>
>> Or, use an "m"
>> constraint with %U2%X2 after the lwz/stw.
>
> The 'b' seems easier ;-)
The advantage of the latter is that it allows GCC to choose indexed or
update instructions -- but that's merely an optimization. Switching to
"b" is enough to avoid the potential bug.
>> %1 also needs to be an early clobber.
>
> Not exactly sure what you mean by the above.
%1 is written to before some inputs are consumed, so you need to use
"=&r" rather than "=r" so that GCC won't use the same register for both.
-Scott
* Eran Liberty ([email protected]) wrote:
> Steven Rostedt wrote:
>> Eran Liberty wrote:
>>> After compiling a kernel with ftrace I started to experience all sorts of
>>> crashes.
>>
>> Just to make sure...
>>
>> ftrace enables markers too, and RCU has tracing with the markers. This may
>> not be the problem, but I just want to eliminate as many variables as
>> possible.
>> Could you disable ftrace, but keep the markers on too. Also, could you
>> enable ftrace again and turn on the FTRACE_STARTUP_TEST.
>
> for the fun of it I took out all my propriety modules; so now its a non
> tainted kernel.
>
> Here is the matrix:
>
> !FTRACE x !MARKERS => stable
> !FTRACE x MARKERS => stable
> FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
> FTRACE x MARKERS => unstable
> FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed
>
> Testing tracer sched_switch: PASSED
> Testing tracer ftrace: PASSED
> Testing dynamic ftrace: PASSED
>
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in:
> NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
> REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
> TASK = ddcce060[1707] 'find' THREAD: dd5b0000
> GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b
> 100234ec
> GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8
> 1008a1d0
> GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58
> dd802084
> GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000
> dd5b1d00
> NIP [c00bbb20] d_lookup+0x40/0x90
> LR [c00bbb20] d_lookup+0x40/0x90
> Call Trace:
> [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
Can you check if, at some point during the system execution (starting
from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
module can be later unloaded, what I wonder is if this address would
appear to have had a loaded+unloaded module).
Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?
Mathieu
> [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
> [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
> [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
> [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
> [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
> [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
> [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
> [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
> [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
> ---[ end trace 1eb8fd7adac2bb65 ]---
>
> Liberty
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 18 Aug 2008, Scott Wood wrote:
> Mathieu Desnoyers wrote:
> > asm volatile (
> > "1: lwz %1, 0(%2)\n"
> > " cmpw %1, %5\n"
> > " bne 2f\n"
> > " stwu %3, 0(%2)\n"
> > "2:\n"
> > ".section .fixup, \"ax\"\n"
> > "3: li %0, 1\n"
> > " b 2b\n"
> > ".previous\n"
> > ".section __ex_table,\"a\"\n"
> > _ASM_ALIGN "\n"
> > _ASM_PTR "1b, 3b\n"
> > ".previous"
> > : "=r"(faulted), "=r"(replaced)
> > : "r"(ip), "r"(new),
> > "0"(faulted), "r"(old)
> > : "memory");
>
> Some (most likely unrelated) nits in the above inline asm:
>
> Should use a "b" constraint for %2, or you could get r0. Or, use an "m"
> constraint with %U2%X2 after the lwz/stw.
What syntax to do that with?
lwz %1,0(%U2)
stu %3, 0(%X2)
I'm new to those. (and the above does not compile)
> Why stwu with an offset of zero,
How else to do it? stwu %3, (%2) does not compile for me.
-- Steve
On Mon, 18 Aug 2008, Eran Liberty wrote:
> Steven Rostedt wrote:
> > Eran Liberty wrote:
> > > After compiling a kernel with ftrace I started to experience all sorts of
> > > crashes.
> >
> > Just to make sure...
> >
> > ftrace enables markers too, and RCU has tracing with the markers. This may
> > not be the problem, but I just want to eliminate as many variables as
> > possible.
> > Could you disable ftrace, but keep the markers on too. Also, could you
> > enable ftrace again and turn on the FTRACE_STARTUP_TEST.
>
> for the fun of it I took out all my propriety modules; so now its a non
> tainted kernel.
>
> Here is the matrix:
>
> !FTRACE x !MARKERS => stable
> !FTRACE x MARKERS => stable
> FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
> FTRACE x MARKERS => unstable
> FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed
Thanks
>
> Testing tracer sched_switch: PASSED
> Testing tracer ftrace: PASSED
> Testing dynamic ftrace: PASSED
>
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in:
> NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
Could you load this into gdb for me and show me the output of:
gdb> li *0xc00bbb20
(Assuming you compiled with debuginfo on)
and...
gdb> disass 0xc00bbb20
> REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
> TASK = ddcce060[1707] 'find' THREAD: dd5b0000
> GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b 100234ec
> GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8 1008a1d0
> GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58 dd802084
> GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000 dd5b1d00
> NIP [c00bbb20] d_lookup+0x40/0x90
> LR [c00bbb20] d_lookup+0x40/0x90
> Call Trace:
> [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
> [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
> [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
> [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
> [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
> [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
> [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
> [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
> [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
> [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
Ouch! we have a 00000000 instruction. I'm almost done with the new mcount
record for PPC (I have it done for ppc64, I'm just porting it to 32). I'm
thinking this new code may solve your issues too. I hate the daemon.
-- Steve
Steven Rostedt wrote:
> What syntax to do that with?
>
> lwz %1,0(%U2)
> stu %3, 0(%X2)
>
> I'm new to those. (and the above does not compile)
lwz%U2%X2 %1, %2
stw%U2%X2 %3, %2
>> Why stwu with an offset of zero,
>
> How else to do it? stwu %3, (%2) does not compile for me.
stw %3, 0(%2)
The "u" tells it to write the effective address back to %2 -- but with
an offset of zero, the effective address is unchanged.
-Scott
On Mon, 18 Aug 2008, Scott Wood wrote:
> Steven Rostedt wrote:
> > What syntax to do that with?
> >
> > lwz %1,0(%U2)
> > stu %3, 0(%X2)
> >
> > I'm new to those. (and the above does not compile)
>
> lwz%U2%X2 %1, %2
> stw%U2%X2 %3, %2
Thanks, that's new to me.
>
> > > Why stwu with an offset of zero,
> >
> > How else to do it? stwu %3, (%2) does not compile for me.
>
> stw %3, 0(%2)
>
> The "u" tells it to write the effective address back to %2 -- but with an
> offset of zero, the effective address is unchanged.
Ah! Right!
/me should open up his PowerPC ref books again :-p
-- Steve
On Mon, 2008-08-18 at 11:07 -0400, Steven Rostedt wrote:
> Eran Liberty wrote:
> > After compiling a kernel with ftrace I started to experience all sorts
> > of crashes.
>
> Just to make sure...
>
> ftrace enables markers too, and RCU has tracing with the markers. This
> may not be the problem, but I just want to eliminate as many variables
> as possible.
> Could you disable ftrace, but keep the markers on too. Also, could you
> enable ftrace again and turn on the FTRACE_STARTUP_TEST.
I spent some time last week tracking one of those crashes and it appears
that we are getting corruption of some of the non-volatile registers.
So far, I found out that it -seems- to be coming from stack frame
corruption during a timer interrupt.
I haven't had a chance to dig further yet.
Cheers,
Ben.
> > Oops: Exception in kernel mode, sig: 11 [#1]
> > Exsw1600
> > Modules linked in:
> > NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
> > REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2)
> > MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
> > TASK = ddcce060[1707] 'find' THREAD: dd5b0000
> > GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b
> > 100234ec
> > GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8
> > 1008a1d0
> > GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58
> > dd802084
> > GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000
> > dd5b1d00
> > NIP [c00bbb20] d_lookup+0x40/0x90
> > LR [c00bbb20] d_lookup+0x40/0x90
> > Call Trace:
> > [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
>
> Can you check if, at some point during the system execution (starting
> from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
> module can be later unloaded, what I wonder is if this address would
> appear to have had a loaded+unloaded module).
>
> Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?
I've had similar crashes (with similar backtraces in the VFS) on
machines with no modules, netbooted zImages.
Ben.
On Mon, 2008-08-18 at 14:27 -0400, Steven Rostedt wrote:
>
> On Mon, 18 Aug 2008, Scott Wood wrote:
>
> > Mathieu Desnoyers wrote:
> > > asm volatile (
> > > "1: lwz %1, 0(%2)\n"
> > > " cmpw %1, %5\n"
> > > " bne 2f\n"
> > > " stwu %3, 0(%2)\n"
> > > "2:\n"
> > > ".section .fixup, \"ax\"\n"
> > > "3: li %0, 1\n"
> > > " b 2b\n"
> > > ".previous\n"
> > > ".section __ex_table,\"a\"\n"
> > > _ASM_ALIGN "\n"
> > > _ASM_PTR "1b, 3b\n"
> > > ".previous"
> > > : "=r"(faulted), "=r"(replaced)
> > > : "r"(ip), "r"(new),
> > > "0"(faulted), "r"(old)
> > > : "memory");
> >
> > Some (most likely unrelated) nits in the above inline asm:
Why not use __get_user/__put_user ?
Cheers,
Ben.
On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> On Mon, 2008-08-18 at 14:27 -0400, Steven Rostedt wrote:
> >
> > On Mon, 18 Aug 2008, Scott Wood wrote:
> >
> > > Mathieu Desnoyers wrote:
> > > > asm volatile (
> > > > "1: lwz %1, 0(%2)\n"
> > > > " cmpw %1, %5\n"
> > > > " bne 2f\n"
> > > > " stwu %3, 0(%2)\n"
> > > > "2:\n"
> > > > ".section .fixup, \"ax\"\n"
> > > > "3: li %0, 1\n"
> > > > " b 2b\n"
> > > > ".previous\n"
> > > > ".section __ex_table,\"a\"\n"
> > > > _ASM_ALIGN "\n"
> > > > _ASM_PTR "1b, 3b\n"
> > > > ".previous"
> > > > : "=r"(faulted), "=r"(replaced)
> > > > : "r"(ip), "r"(new),
> > > > "0"(faulted), "r"(old)
> > > > : "memory");
> > >
> > > Some (most likely unrelated) nits in the above inline asm:
>
>
> Why not use __get_user/__put_user ?
Hmm, this was originally copied from x86, where we did a cmpxchg, but that
is probably not needed since all of this is done in kstop_machine. Also,
only the "get" is needed. If we don't fault there, we wont fault on the
put (unless we have permissions wrong, and that would be a bug).
So are you recommending something like
int cmd;
if (__get_user(cmd, ip))
goto fault;
if (cmd != old)
goto not_same;
WARN_ON_ONCE(__put_user(cmd, ip));
If we did this, we could probably put this into the generic code:
if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
goto fault;
if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
goto not_same;
WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));
-- Steve
> Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> is probably not needed since all of this is done in kstop_machine. Also,
> only the "get" is needed. If we don't fault there, we wont fault on the
> put (unless we have permissions wrong, and that would be a bug).
Would it ? How do we make sure the kernel text is mapped writeable ?
> So are you recommending something like
>
> int cmd;
>
> if (__get_user(cmd, ip))
> goto fault;
>
> if (cmd != old)
> goto not_same;
>
> WARN_ON_ONCE(__put_user(cmd, ip));
>
> If we did this, we could probably put this into the generic code:
That would work I suppose, I'll give it a try.
> if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
> goto fault;
>
> if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
> goto not_same;
>
> WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));
You need the __ variants or the access_ok() checks will bite
you bad.
Ben.
On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
>
> > Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> > is probably not needed since all of this is done in kstop_machine. Also,
> > only the "get" is needed. If we don't fault there, we wont fault on the
> > put (unless we have permissions wrong, and that would be a bug).
>
> Would it ? How do we make sure the kernel text is mapped writeable ?
We map it writeable if FTRACE is enabled.
>
> > So are you recommending something like
> >
> > int cmd;
> >
> > if (__get_user(cmd, ip))
> > goto fault;
> >
> > if (cmd != old)
> > goto not_same;
> >
> > WARN_ON_ONCE(__put_user(cmd, ip));
> >
> > If we did this, we could probably put this into the generic code:
>
> That would work I suppose, I'll give it a try.
>
> > if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
> > goto fault;
> >
> > if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
> > goto not_same;
> >
> > WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));
>
> You need the __ variants or the access_ok() checks will bite
> you bad.
Ah, good point ;-)
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
>
> >
> > > Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> > > is probably not needed since all of this is done in kstop_machine. Also,
> > > only the "get" is needed. If we don't fault there, we wont fault on the
> > > put (unless we have permissions wrong, and that would be a bug).
> >
> > Would it ? How do we make sure the kernel text is mapped writeable ?
>
> We map it writeable if FTRACE is enabled.
>
Argh. See text_poke(). It's there exactly for this purpose on x86.
Mathieu
> >
> > > So are you recommending something like
> > >
> > > int cmd;
> > >
> > > if (__get_user(cmd, ip))
> > > goto fault;
> > >
> > > if (cmd != old)
> > > goto not_same;
> > >
> > > WARN_ON_ONCE(__put_user(cmd, ip));
> > >
> > > If we did this, we could probably put this into the generic code:
> >
> > That would work I suppose, I'll give it a try.
> >
> > > if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
> > > goto fault;
> > >
> > > if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
> > > goto not_same;
> > >
> > > WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));
> >
> > You need the __ variants or the access_ok() checks will bite
> > you bad.
>
> Ah, good point ;-)
>
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Ok, so i did a patch, but it doesn't fix the problem. So
there's something else whacking on the stack frames.
Still, here it is:
powerpc/ftrace: Fix broken assembly for code replacement
Instead, uses __get_user() and __put_user().
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
Index: linux-work/arch/powerpc/kernel/ftrace.c
===================================================================
--- linux-work.orig/arch/powerpc/kernel/ftrace.c 2008-08-19 12:45:49.000000000 +1000
+++ linux-work/arch/powerpc/kernel/ftrace.c 2008-08-19 12:52:51.000000000 +1000
@@ -16,7 +16,7 @@
#include <asm/cacheflush.h>
#include <asm/ftrace.h>
-
+#include <asm/uaccess.h>
static unsigned int ftrace_nop = 0x60000000;
@@ -72,10 +72,9 @@ notrace int
ftrace_modify_code(unsigned long ip, unsigned char *old_code,
unsigned char *new_code)
{
- unsigned replaced;
- unsigned old = *(unsigned *)old_code;
- unsigned new = *(unsigned *)new_code;
- int faulted = 0;
+ unsigned int old = *(unsigned int *)old_code;
+ unsigned int new = *(unsigned int *)new_code;
+ unsigned int instr;
/*
* Note: Due to modules and __init, code can
@@ -85,32 +84,13 @@ ftrace_modify_code(unsigned long ip, uns
* No real locking needed, this code is run through
* kstop_machine.
*/
- asm volatile (
- "1: lwz %1, 0(%2)\n"
- " cmpw %1, %5\n"
- " bne 2f\n"
- " stwu %3, 0(%2)\n"
- "2:\n"
- ".section .fixup, \"ax\"\n"
- "3: li %0, 1\n"
- " b 2b\n"
- ".previous\n"
- ".section __ex_table,\"a\"\n"
- _ASM_ALIGN "\n"
- _ASM_PTR "1b, 3b\n"
- ".previous"
- : "=r"(faulted), "=r"(replaced)
- : "r"(ip), "r"(new),
- "0"(faulted), "r"(old)
- : "memory");
-
- if (replaced != old && replaced != new)
- faulted = 2;
-
- if (!faulted)
- flush_icache_range(ip, ip + 8);
-
- return faulted;
+ if (__get_user(instr, (unsigned int __user *)ip))
+ return 1;
+ if (instr != old && instr != new)
+ return 2;
+ WARN_ON_ONCE(__put_user(new, (unsigned int __user *)ip));
+ flush_icache_range(ip, ip + 8);
+ return 0;
}
notrace int ftrace_update_ftrace_func(ftrace_func_t func)
On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> Ok, so i did a patch, but it doesn't fix the problem. So
> there's something else whacking on the stack frames.
Is it 32 bit or 64? I've tested the 64 bit quite a bit, but not so much
the 32 (my powerbook is not usually that stable).
You might want to look at the entry.S mcount code too.
>
> Still, here it is:
Thanks,
>
> powerpc/ftrace: Fix broken assembly for code replacement
>
> Instead, uses __get_user() and __put_user().
>
Acked-by: Steven Rostedt <[email protected]>
-- Steve
On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> >
> > >
> > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> > > > is probably not needed since all of this is done in kstop_machine. Also,
> > > > only the "get" is needed. If we don't fault there, we wont fault on the
> > > > put (unless we have permissions wrong, and that would be a bug).
> > >
> > > Would it ? How do we make sure the kernel text is mapped writeable ?
> >
> > We map it writeable if FTRACE is enabled.
> >
>
> Argh. See text_poke(). It's there exactly for this purpose on x86.
Ouch, I just did. text_poke is quite heavy. It would be interesting to see
that performed on 20,000 locations at one time. I could play with it, but
I'm a bit nervous.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
>
> > * Steven Rostedt ([email protected]) wrote:
> > >
> > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> > >
> > > >
> > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> > > > > is probably not needed since all of this is done in kstop_machine. Also,
> > > > > only the "get" is needed. If we don't fault there, we wont fault on the
> > > > > put (unless we have permissions wrong, and that would be a bug).
> > > >
> > > > Would it ? How do we make sure the kernel text is mapped writeable ?
> > >
> > > We map it writeable if FTRACE is enabled.
> > >
> >
> > Argh. See text_poke(). It's there exactly for this purpose on x86.
>
> Ouch, I just did. text_poke is quite heavy. It would be interesting to see
> that performed on 20,000 locations at one time. I could play with it, but
> I'm a bit nervous.
>
It's alread used to modify the LOCK prefixes in alternative.c and did
not seem to be too slow for that.. it should therefore be ok.
Mathieu
> -- Steve
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
> > > >
> > >
> > > Argh. See text_poke(). It's there exactly for this purpose on x86.
> >
> > Ouch, I just did. text_poke is quite heavy. It would be interesting to see
> > that performed on 20,000 locations at one time. I could play with it, but
> > I'm a bit nervous.
> >
>
> It's alread used to modify the LOCK prefixes in alternative.c and did
> not seem to be too slow for that.. it should therefore be ok.
There's a lot more functions than locks ;-)
Anyway, this is only needed after system boot up. But then we do it ever
time we start or stop tracing. I'll try it out and see if it is noticable.
-- Steve
On Mon, 2008-08-18 at 23:12 -0400, Steven Rostedt wrote:
> On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
>
> > Ok, so i did a patch, but it doesn't fix the problem. So
> > there's something else whacking on the stack frames.
>
> Is it 32 bit or 64? I've tested the 64 bit quite a bit, but not so much
> the 32 (my powerbook is not usually that stable).
32-bit. Reproduced on powerbook by a user, 440, 750, ...
> You might want to look at the entry.S mcount code too.
Yeah, I had a look, nothing obviously bad there, but I may just
have missed it :-)
Ben.
Mathieu Desnoyers wrote:
> Can you check if, at some point during the system execution (starting
> from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
> module can be later unloaded, what I wonder is if this address would
> appear to have had a loaded+unloaded module).
>
> Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?
>
> Mathieu
>
No modules...
~ # uname -a
Linux 2.6.27-rc2 #6 Tue Aug 19 12:33:34 IDT 2008 ppc unknown
~ # gunzip /proc/config.gz -c | grep UNLOAD
# CONFIG_MODULE_UNLOAD is not set
~ # gunzip /proc/config.gz -c | grep FTRACE
CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
~ # gunzip /proc/config.gz -c | grep MARKERS
CONFIG_MARKERS=y
~ # lsmod
Module Size Used by
~ # while [ 1 ] ; do find / > /dev/null ; echo . ; done
.
Oops: Exception in kernel mode, sig: 11 [#1]
Exsw1600
Modules linked in:
NIP: c00bb724 LR: c00bb724 CTR: 00000000
REGS: d8b59c50 TRAP: 0700 Not tainted (2.6.27-rc2)
MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
TASK = dbc68de0[1712] 'find' THREAD: d8b58000
GPR00: 00000000 d8b59d00 dbc68de0 dd801180 d8b59d68 d8b59d58 dd8019db
100234ec
GPR08: c0800000 00019330 0000ffff d8b59d20 24000288 100ad874 100936f8
1008a1d0
GPR16: 10083f80 d8b59e2c d8b59d68 fffffff4 c0380000 d8b59d60 d8b59d58
dd802084
GPR24: ddc69500 dd802018 d8b59d68 c0380000 dd801180 d8b59d68 00000000
d8b59d00
NIP [c00bb724] d_lookup+0x40/0x90
LR [c00bb724] d_lookup+0x40/0x90
Call Trace:
[d8b59d00] [d8b59d58] 0xd8b59d58 (unreliable)
[d8b59d20] [c00ae7c8] do_lookup+0xe8/0x220
[d8b59d50] [c00b0684] __link_path_walk+0x5a4/0xd54
[d8b59dc0] [c00b0e8c] path_walk+0x58/0xe0
[d8b59df0] [c00b0ffc] do_path_lookup+0x78/0x13c
[d8b59e20] [c00b1cf8] user_path_at+0x64/0xac
[d8b59e90] [c00a8c98] vfs_lstat_fd+0x34/0x74
[d8b59ec0] [c00a8d6c] vfs_lstat+0x30/0x48
[d8b59ed0] [c00a8db4] sys_lstat64+0x30/0x5c
[d8b59f40] [c0010554] ret_from_syscall+0x0/0x3c
Instruction dump:
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
---[ end trace 7766edd310cd3442 ]---
Steven Rostedt wrote:
> On Mon, 18 Aug 2008, Eran Liberty wrote:
>
>
>> Steven Rostedt wrote:
>>
>>> Eran Liberty wrote:
>>>
>>>> After compiling a kernel with ftrace I started to experience all sorts of
>>>> crashes.
>>>>
>>> Just to make sure...
>>>
>>> ftrace enables markers too, and RCU has tracing with the markers. This may
>>> not be the problem, but I just want to eliminate as many variables as
>>> possible.
>>> Could you disable ftrace, but keep the markers on too. Also, could you
>>> enable ftrace again and turn on the FTRACE_STARTUP_TEST.
>>>
>> for the fun of it I took out all my propriety modules; so now its a non
>> tainted kernel.
>>
>> Here is the matrix:
>>
>> !FTRACE x !MARKERS => stable
>> !FTRACE x MARKERS => stable
>> FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
>> FTRACE x MARKERS => unstable
>> FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed
>>
>
> Thanks
>
>
>> Testing tracer sched_switch: PASSED
>> Testing tracer ftrace: PASSED
>> Testing dynamic ftrace: PASSED
>>
>> Oops: Exception in kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in:
>> NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
>>
>
> Could you load this into gdb for me and show me the output of:
>
> gdb> li *0xc00bbb20
>
> (Assuming you compiled with debuginfo on)
>
> and...
>
> gdb> disass 0xc00bbb20
>
>
>> REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2)
>> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
>> TASK = ddcce060[1707] 'find' THREAD: dd5b0000
>> GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b 100234ec
>> GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8 1008a1d0
>> GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58 dd802084
>> GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000 dd5b1d00
>> NIP [c00bbb20] d_lookup+0x40/0x90
>> LR [c00bbb20] d_lookup+0x40/0x90
>> Call Trace:
>> [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
>> [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
>> [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
>> [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
>> [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
>> [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
>> [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
>> [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
>> [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
>> [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
>>
>
> Ouch! we have a 00000000 instruction. I'm almost done with the new mcount
> record for PPC (I have it done for ppc64, I'm just porting it to 32). I'm
> thinking this new code may solve your issues too. I hate the daemon.
>
> -- Steve
>
>
>
I have attached ddd. Up on crashing it points on this line
struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
struct dentry * dentry = NULL;
unsigned long seq;
do {
seq = read_seqbegin(&rename_lock);
dentry = __d_lookup(parent, name); <==== ddd marker
if (dentry)
break;
} while (read_seqretry(&rename_lock, seq));
return dentry;
}
(gdb) bt
#0 d_lookup (parent=0xdd801180, name=0xd99b3d68) at fs/dcache.c:1361
#1 0xc00ae7c8 in real_lookup (nd=<value optimized out>, name=<value
optimized out>, parent=0xdd801180) at fs/namei.c:507
#2 do_lookup (nd=0xd99b3e28, name=0xd99b3d68, path=0xd99b3d58) at
fs/namei.c:825
#3 0xc00b0684 in __link_path_walk (name=0xddc23009 "", nd=0xd99b3e28)
at fs/namei.c:982
#4 0xc00b0e8c in link_path_walk (nd=<value optimized out>, name=<value
optimized out>) at fs/namei.c:570
#5 path_walk (name=0xddc23000 "/proc/mtd", nd=0xd99b3e28) at
fs/namei.c:1041
#6 0xc00b0ffc in do_path_lookup (dfd=<value optimized out>,
name=0xddc23000 "/proc/mtd", flags=<value optimized out>, nd=0xd99b3e28)
at fs/namei.c:1091
#7 0xc00b1cf8 in user_path_at (dfd=-100, name=<value optimized out>,
flags=0, path=0xd99b3e98) at fs/namei.c:1334
#8 0xc00a8c98 in vfs_lstat_fd (dfd=-578809472, name=0xd99b3d68 "",
stat=0xd99b3ed8) at fs/stat.c:83
#9 0xc00a8d6c in vfs_lstat (name=0xd99b3d68 "", stat=0xd99b3d58) at
fs/stat.c:93
#10 0xc00a8db4 in sys_lstat64 (filename=0xdd801180 "",
statbuf=0xbfe285d8) at fs/stat.c:381
#11 0xc0010554 in syscall_dotrace_cont ()
both cp & lr points to 0xC00BB724
(gdb) info registers
r0 0x0 0
r1 0xd99b3d00 3650829568
r2 0xddccf2e0 3721196256
r3 0xdd801180 3716157824
r4 0xd99b3d68 3650829672
r5 0xd99b3d58 3650829656
r6 0xdd822a5b 3716295259
r7 0x100234ec 268580076
r8 0xc0800000 3229614080
r9 0x19330 103216
r10 0xffff 65535
r11 0xd99b3d20 3650829600
r12 0x24000288 603980424
r13 0x100ad874 269146228
r14 0x100936f8 269039352
r15 0x1008a1d0 269001168
r16 0x10083f80 268976000
r17 0xd99b3e2c 3650829868
r18 0xd99b3d68 3650829672
r19 0xfffffff4 4294967284
r20 0xc0380000 3224895488
r21 0xd99b3d60 3650829664
r22 0xd99b3d58 3650829656
r23 0xdd802084 3716161668
r24 0xdc3fb280 3695161984
r25 0xdd802018 3716161560
r26 0xd99b3d68 3650829672
r27 0xc0380000 3224895488
r28 0xdd801180 3716157824
r29 0xd99b3d68 3650829672
r30 0x0 0
r31 0xd99b3d00 3650829568
pc 0xc00bb724 3221993252
cr 0x24082282 604512898
lr 0xc00bb724 3221993252
ctr 0x0 0
xer 0x20000000 536870912
(gdb) li *0xC00BB724
Line 1361 of "fs/dcache.c" starts at address 0xc00bb724 <d_lookup+64>
and ends at 0xc00bb728 <d_lookup+68>.
(gdb) disassemble 0xC00BB724
Dump of assembler code for function d_lookup:
0xc00bb6e4 <d_lookup+0>: mflr r0
0xc00bb6e8 <d_lookup+4>: stw r0,4(r1)
0xc00bb6ec <d_lookup+8>: nop
0xc00bb6f0 <d_lookup+12>: stwu r1,-32(r1)
0xc00bb6f4 <d_lookup+16>: mflr r0
0xc00bb6f8 <d_lookup+20>: stmw r27,12(r1)
0xc00bb6fc <d_lookup+24>: lis r27,-16328
0xc00bb700 <d_lookup+28>: mr r31,r1
0xc00bb704 <d_lookup+32>: stw r0,36(r1)
0xc00bb708 <d_lookup+36>: mr r28,r3
0xc00bb70c <d_lookup+40>: mr r29,r4
0xc00bb710 <d_lookup+44>: lwz r30,12960(r27)
0xc00bb714 <d_lookup+48>: andi. r0,r30,1
0xc00bb718 <d_lookup+52>: mr r3,r28
0xc00bb71c <d_lookup+56>: mr r4,r29
0xc00bb720 <d_lookup+60>: bnela- 0x2c
0xc00bb724 <d_lookup+64>: .long 0x0
0xc00bb728 <d_lookup+68>: cmpwi cr7,r3,0
0xc00bb72c <d_lookup+72>: bne- cr7,0xc00bb75c <d_lookup+120>
0xc00bb730 <d_lookup+76>: lwz r0,12960(r27)
0xc00bb734 <d_lookup+80>: cmpw cr7,r0,r30
0xc00bb738 <d_lookup+84>: beq- cr7,0xc00bb75c <d_lookup+120>
0xc00bb73c <d_lookup+88>: mr r30,r0
0xc00bb740 <d_lookup+92>: andi. r0,r30,1
0xc00bb744 <d_lookup+96>: mr r3,r28
0xc00bb748 <d_lookup+100>: mr r4,r29
0xc00bb74c <d_lookup+104>: beq+ 0xc00bb724 <d_lookup+64>
0xc00bb750 <d_lookup+108>: lwz r0,12960(r27)
0xc00bb754 <d_lookup+112>: mr r30,r0
0xc00bb758 <d_lookup+116>: b 0xc00bb740 <d_lookup+92>
0xc00bb75c <d_lookup+120>: lwz r11,0(r1)
0xc00bb760 <d_lookup+124>: lwz r0,4(r11)
0xc00bb764 <d_lookup+128>: lmw r27,-20(r11)
0xc00bb768 <d_lookup+132>: mtlr r0
0xc00bb76c <d_lookup+136>: mr r1,r11
0xc00bb770 <d_lookup+140>: blr
End of assembler dump.
(gdb)
Hope this is helpfull
Liberty
* Eran Liberty ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> Can you check if, at some point during the system execution (starting
>> from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
>> module can be later unloaded, what I wonder is if this address would
>> appear to have had a loaded+unloaded module).
>>
>> Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?
>>
>> Mathieu
>>
> No modules...
>
> ~ # uname -a
> Linux 2.6.27-rc2 #6 Tue Aug 19 12:33:34 IDT 2008 ppc unknown
> ~ # gunzip /proc/config.gz -c | grep UNLOAD
> # CONFIG_MODULE_UNLOAD is not set
> ~ # gunzip /proc/config.gz -c | grep FTRACE
> CONFIG_HAVE_FTRACE=y
> CONFIG_HAVE_DYNAMIC_FTRACE=y
> CONFIG_FTRACE=y
> CONFIG_DYNAMIC_FTRACE=y
> CONFIG_FTRACE_SELFTEST=y
> CONFIG_FTRACE_STARTUP_TEST=y
> ~ # gunzip /proc/config.gz -c | grep MARKERS
> CONFIG_MARKERS=y
> ~ # lsmod
> Module Size Used by
> ~ # while [ 1 ] ; do find / > /dev/null ; echo . ; done
> .
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in:
> NIP: c00bb724 LR: c00bb724 CTR: 00000000
> REGS: d8b59c50 TRAP: 0700 Not tainted (2.6.27-rc2)
> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
> TASK = dbc68de0[1712] 'find' THREAD: d8b58000
> GPR00: 00000000 d8b59d00 dbc68de0 dd801180 d8b59d68 d8b59d58 dd8019db
> 100234ec
> GPR08: c0800000 00019330 0000ffff d8b59d20 24000288 100ad874 100936f8
> 1008a1d0
> GPR16: 10083f80 d8b59e2c d8b59d68 fffffff4 c0380000 d8b59d60 d8b59d58
> dd802084
Register GPR30 is always zero... (also true for the other oops)
> GPR24: ddc69500 dd802018 d8b59d68 c0380000 dd801180 d8b59d68 00000000
> d8b59d00
> NIP [c00bb724] d_lookup+0x40/0x90
> LR [c00bb724] d_lookup+0x40/0x90
> Call Trace:
> [d8b59d00] [d8b59d58] 0xd8b59d58 (unreliable)
> [d8b59d20] [c00ae7c8] do_lookup+0xe8/0x220
> [d8b59d50] [c00b0684] __link_path_walk+0x5a4/0xd54
> [d8b59dc0] [c00b0e8c] path_walk+0x58/0xe0
> [d8b59df0] [c00b0ffc] do_path_lookup+0x78/0x13c
> [d8b59e20] [c00b1cf8] user_path_at+0x64/0xac
> [d8b59e90] [c00a8c98] vfs_lstat_fd+0x34/0x74
> [d8b59ec0] [c00a8d6c] vfs_lstat+0x30/0x48
> [d8b59ed0] [c00a8db4] sys_lstat64+0x30/0x5c
> [d8b59f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
Just like this instruction... maybe related ?
Mathieu
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
> ---[ end trace 7766edd310cd3442 ]---
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Eran Liberty ([email protected]) wrote:
> Steven Rostedt wrote:
>> On Mon, 18 Aug 2008, Eran Liberty wrote:
>>
>>
>>> Steven Rostedt wrote:
>>>
>>>> Eran Liberty wrote:
>>>>
>>>>> After compiling a kernel with ftrace I started to experience all sorts
>>>>> of
>>>>> crashes.
>>>>>
>>>> Just to make sure...
>>>>
>>>> ftrace enables markers too, and RCU has tracing with the markers. This
>>>> may
>>>> not be the problem, but I just want to eliminate as many variables as
>>>> possible.
>>>> Could you disable ftrace, but keep the markers on too. Also, could you
>>>> enable ftrace again and turn on the FTRACE_STARTUP_TEST.
>>>>
>>> for the fun of it I took out all my propriety modules; so now its a non
>>> tainted kernel.
>>>
>>> Here is the matrix:
>>>
>>> !FTRACE x !MARKERS => stable
>>> !FTRACE x MARKERS => stable
>>> FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
>>> FTRACE x MARKERS => unstable
>>> FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed
>>>
>>
>> Thanks
>>
>>
>>> Testing tracer sched_switch: PASSED
>>> Testing tracer ftrace: PASSED
>>> Testing dynamic ftrace: PASSED
>>>
>>> Oops: Exception in kernel mode, sig: 11 [#1]
>>> Exsw1600
>>> Modules linked in:
>>> NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
>>>
>>
>> Could you load this into gdb for me and show me the output of:
>>
>> gdb> li *0xc00bbb20
>>
>> (Assuming you compiled with debuginfo on)
>>
>> and...
>>
>> gdb> disass 0xc00bbb20
>>
>>
>>> REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2)
>>> MSR: 00029000 <EE,ME> CR: 24082282 XER: 20000000
>>> TASK = ddcce060[1707] 'find' THREAD: dd5b0000
>>> GPR00: 00000000 dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b
>>> 100234ec
>>> GPR08: c0800000 00019330 0000ffff dd5b1d20 24000288 100ad874 100936f8
>>> 1008a1d0
>>> GPR16: 10083f80 dd5b1e2c dd5b1d68 fffffff4 c0380000 dd5b1d60 dd5b1d58
>>> dd802084
>>> GPR24: dc3d7700 dd802018 dd5b1d68 c0380000 dd801180 dd5b1d68 00000000
>>> dd5b1d00
>>> NIP [c00bbb20] d_lookup+0x40/0x90
>>> LR [c00bbb20] d_lookup+0x40/0x90
>>> Call Trace:
>>> [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)
>>> [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
>>> [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
>>> [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
>>> [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
>>> [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
>>> [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
>>> [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
>>> [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
>>> [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
>>> Instruction dump:
>>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
>>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
>>>
>>
>> Ouch! we have a 00000000 instruction. I'm almost done with the new mcount
>> record for PPC (I have it done for ppc64, I'm just porting it to 32). I'm
>> thinking this new code may solve your issues too. I hate the daemon.
>>
>> -- Steve
>>
>>
>>
>
> I have attached ddd. Up on crashing it points on this line
>
> struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
> {
> struct dentry * dentry = NULL;
> unsigned long seq;
> do {
> seq = read_seqbegin(&rename_lock);
> dentry = __d_lookup(parent, name); <==== ddd marker
> if (dentry)
> break;
> } while (read_seqretry(&rename_lock, seq));
> return dentry;
> }
>
> (gdb) bt
> #0 d_lookup (parent=0xdd801180, name=0xd99b3d68) at fs/dcache.c:1361
> #1 0xc00ae7c8 in real_lookup (nd=<value optimized out>, name=<value
> optimized out>, parent=0xdd801180) at fs/namei.c:507
> #2 do_lookup (nd=0xd99b3e28, name=0xd99b3d68, path=0xd99b3d58) at
> fs/namei.c:825
> #3 0xc00b0684 in __link_path_walk (name=0xddc23009 "", nd=0xd99b3e28) at
> fs/namei.c:982
> #4 0xc00b0e8c in link_path_walk (nd=<value optimized out>, name=<value
> optimized out>) at fs/namei.c:570
> #5 path_walk (name=0xddc23000 "/proc/mtd", nd=0xd99b3e28) at
> fs/namei.c:1041
> #6 0xc00b0ffc in do_path_lookup (dfd=<value optimized out>,
> name=0xddc23000 "/proc/mtd", flags=<value optimized out>, nd=0xd99b3e28) at
> fs/namei.c:1091
> #7 0xc00b1cf8 in user_path_at (dfd=-100, name=<value optimized out>,
> flags=0, path=0xd99b3e98) at fs/namei.c:1334
> #8 0xc00a8c98 in vfs_lstat_fd (dfd=-578809472, name=0xd99b3d68 "",
> stat=0xd99b3ed8) at fs/stat.c:83
> #9 0xc00a8d6c in vfs_lstat (name=0xd99b3d68 "", stat=0xd99b3d58) at
> fs/stat.c:93
> #10 0xc00a8db4 in sys_lstat64 (filename=0xdd801180 "", statbuf=0xbfe285d8)
> at fs/stat.c:381
> #11 0xc0010554 in syscall_dotrace_cont ()
>
> both cp & lr points to 0xC00BB724
> (gdb) info registers
> r0 0x0 0
> r1 0xd99b3d00 3650829568
> r2 0xddccf2e0 3721196256
> r3 0xdd801180 3716157824
> r4 0xd99b3d68 3650829672
> r5 0xd99b3d58 3650829656
> r6 0xdd822a5b 3716295259
> r7 0x100234ec 268580076
> r8 0xc0800000 3229614080
> r9 0x19330 103216
> r10 0xffff 65535
> r11 0xd99b3d20 3650829600
> r12 0x24000288 603980424
> r13 0x100ad874 269146228
> r14 0x100936f8 269039352
> r15 0x1008a1d0 269001168
> r16 0x10083f80 268976000
> r17 0xd99b3e2c 3650829868
> r18 0xd99b3d68 3650829672
> r19 0xfffffff4 4294967284
> r20 0xc0380000 3224895488
> r21 0xd99b3d60 3650829664
> r22 0xd99b3d58 3650829656
> r23 0xdd802084 3716161668
> r24 0xdc3fb280 3695161984
> r25 0xdd802018 3716161560
> r26 0xd99b3d68 3650829672
> r27 0xc0380000 3224895488
> r28 0xdd801180 3716157824
> r29 0xd99b3d68 3650829672
> r30 0x0 0
> r31 0xd99b3d00 3650829568
> pc 0xc00bb724 3221993252
> cr 0x24082282 604512898
> lr 0xc00bb724 3221993252
> ctr 0x0 0
> xer 0x20000000 536870912
>
> (gdb) li *0xC00BB724
> Line 1361 of "fs/dcache.c" starts at address 0xc00bb724 <d_lookup+64> and
> ends at 0xc00bb728 <d_lookup+68>.
>
> (gdb) disassemble 0xC00BB724
> Dump of assembler code for function d_lookup:
> 0xc00bb6e4 <d_lookup+0>: mflr r0
> 0xc00bb6e8 <d_lookup+4>: stw r0,4(r1)
> 0xc00bb6ec <d_lookup+8>: nop
> 0xc00bb6f0 <d_lookup+12>: stwu r1,-32(r1)
> 0xc00bb6f4 <d_lookup+16>: mflr r0
> 0xc00bb6f8 <d_lookup+20>: stmw r27,12(r1)
> 0xc00bb6fc <d_lookup+24>: lis r27,-16328
> 0xc00bb700 <d_lookup+28>: mr r31,r1
> 0xc00bb704 <d_lookup+32>: stw r0,36(r1)
> 0xc00bb708 <d_lookup+36>: mr r28,r3
> 0xc00bb70c <d_lookup+40>: mr r29,r4
> 0xc00bb710 <d_lookup+44>: lwz r30,12960(r27)
> 0xc00bb714 <d_lookup+48>: andi. r0,r30,1
> 0xc00bb718 <d_lookup+52>: mr r3,r28
> 0xc00bb71c <d_lookup+56>: mr r4,r29
> 0xc00bb720 <d_lookup+60>: bnela- 0x2c
> 0xc00bb724 <d_lookup+64>: .long 0x0
> 0xc00bb728 <d_lookup+68>: cmpwi cr7,r3,0
> 0xc00bb72c <d_lookup+72>: bne- cr7,0xc00bb75c <d_lookup+120>
> 0xc00bb730 <d_lookup+76>: lwz r0,12960(r27)
> 0xc00bb734 <d_lookup+80>: cmpw cr7,r0,r30
> 0xc00bb738 <d_lookup+84>: beq- cr7,0xc00bb75c <d_lookup+120>
> 0xc00bb73c <d_lookup+88>: mr r30,r0
> 0xc00bb740 <d_lookup+92>: andi. r0,r30,1
> 0xc00bb744 <d_lookup+96>: mr r3,r28
> 0xc00bb748 <d_lookup+100>: mr r4,r29
> 0xc00bb74c <d_lookup+104>: beq+ 0xc00bb724 <d_lookup+64>
> 0xc00bb750 <d_lookup+108>: lwz r0,12960(r27)
> 0xc00bb754 <d_lookup+112>: mr r30,r0
> 0xc00bb758 <d_lookup+116>: b 0xc00bb740 <d_lookup+92>
> 0xc00bb75c <d_lookup+120>: lwz r11,0(r1)
> 0xc00bb760 <d_lookup+124>: lwz r0,4(r11)
> 0xc00bb764 <d_lookup+128>: lmw r27,-20(r11)
> 0xc00bb768 <d_lookup+132>: mtlr r0
> 0xc00bb76c <d_lookup+136>: mr r1,r11
> 0xc00bb770 <d_lookup+140>: blr
> End of assembler dump.
> (gdb)
>
> Hope this is helpfull
>
Can you also give us
objdump -S --start-address=0xC00BB724 vmlinux | head 20
?
Then we could compare the result with the OOPS instruction dump :
7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
Mathieu
> Liberty
>
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> Can you also give us
>
> objdump -S --start-address=0xC00BB724 vmlinux | head 20
>
> ?
>
> Then we could compare the result with the OOPS instruction dump :
>
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
>
> Mathieu
>
>
to give you more context I have run:
> powerpc-linux-gnu-objdump -S --start-address=0xC00BB700 vmlinux |
head -n 60
the discrepancy starts at address:
c00bb720: 40 82 00 30 <=> 40 82 00 2f
c00bb724: 4b ff fe 61 <=> 00 00 00 00
vmlinux: file format elf32-powerpc
Disassembly of section .text:
c00bb700 <d_lookup+0x1c>:
* d_lookup() is protected against the concurrent renames in some unrelated
* directory using the seqlockt_t rename_lock.
*/
struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
c00bb700: 7c 3f 0b 78 mr r31,r1
c00bb704: 90 01 00 24 stw r0,36(r1)
c00bb708: 7c 7c 1b 78 mr r28,r3
c00bb70c: 7c 9d 23 78 mr r29,r4
c00bb710: 83 db 32 a0 lwz r30,12960(r27)
{
unsigned ret;
repeat:
ret = sl->sequence;
smp_rmb();
c00bb714: 73 c0 00 01 andi. r0,r30,1
struct dentry * dentry = NULL;
unsigned long seq;
do {
seq = read_seqbegin(&rename_lock);
dentry = __d_lookup(parent, name);
c00bb718: 7f 83 e3 78 mr r3,r28
c00bb71c: 7f a4 eb 78 mr r4,r29
if (unlikely(ret & 1)) {
c00bb720: 40 82 00 30 bne- c00bb750 <d_lookup+0x6c>
c00bb724: 4b ff fe 61 bl c00bb584 <__d_lookup>
if (dentry)
c00bb728: 2f 83 00 00 cmpwi cr7,r3,0
c00bb72c: 40 9e 00 30 bne- cr7,c00bb75c <d_lookup+0x78>
*
* If sequence value changed then writer changed data while in section.
*/
static __always_inline int read_seqretry(const seqlock_t *sl, unsigned
start)
{
smp_rmb();
c00bb730: 80 1b 32 a0 lwz r0,12960(r27)
break;
} while (read_seqretry(&rename_lock, seq));
c00bb734: 7f 80 f0 00 cmpw cr7,r0,r30
c00bb738: 41 9e 00 24 beq- cr7,c00bb75c <d_lookup+0x78>
/* Start of read calculation -- fetch last complete writer token */
static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
{
unsigned ret;
repeat:
c00bb73c: 7c 1e 03 78 mr r30,r0
ret = sl->sequence;
smp_rmb();
c00bb740: 73 c0 00 01 andi. r0,r30,1
struct dentry * dentry = NULL;
* Eran Liberty ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> Can you also give us
>>
>> objdump -S --start-address=0xC00BB724 vmlinux | head 20
>>
>> ?
>>
>> Then we could compare the result with the OOPS instruction dump :
>>
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
>> 73c00001 7f83e378 7fa4eb78 4082002f <00000000> 2f830000 409e0030 801b32a0
>>
>> Mathieu
>>
>>
>
> to give you more context I have run:
> > powerpc-linux-gnu-objdump -S --start-address=0xC00BB700 vmlinux | head -n
> 60
>
> the discrepancy starts at address:
> c00bb720: 40 82 00 30 <=> 40 82 00 2f
> c00bb724: 4b ff fe 61 <=> 00 00 00 00
>
Ah !
I think I see what could be wrong :
First we have :
static unsigned int ftrace_nop = 0x60000000;
We probably replace the original function call by this nop.
Then we do :
notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
{
static unsigned int op;
/*
* It would be nice to just use create_function_call, but that will
* update the code itself. Here we need to just return the
* instruction that is going to be modified, without modifying the
* code.
*/
addr = GET_ADDR(addr);
/* Set to "bl addr" */
op = 0x48000001 | (ftrace_calc_offset(ip, addr) & 0x03fffffc);
/*
* No locking needed, this must be called via kstop_machine
* which in essence is like running on a uniprocessor machine.
*/
return (unsigned char *)&op;
}
And I guess we must be doing a 0x48000001 | 0x0; or something ?
Also, we have to consider that POWERPC 64 functions are :
/* PowerPC64's functions are data that points to the functions */
And this does not seem to hold for ppc32. Therefore, it is strange to me
that the same code is used for the update.. are we updating the correct
site ?
Mathieu
> vmlinux: file format elf32-powerpc
>
> Disassembly of section .text:
>
> c00bb700 <d_lookup+0x1c>:
> * d_lookup() is protected against the concurrent renames in some unrelated
> * directory using the seqlockt_t rename_lock.
> */
>
> struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
> {
> c00bb700: 7c 3f 0b 78 mr r31,r1
> c00bb704: 90 01 00 24 stw r0,36(r1)
> c00bb708: 7c 7c 1b 78 mr r28,r3
> c00bb70c: 7c 9d 23 78 mr r29,r4
> c00bb710: 83 db 32 a0 lwz r30,12960(r27)
> {
> unsigned ret;
>
> repeat:
> ret = sl->sequence;
> smp_rmb();
> c00bb714: 73 c0 00 01 andi. r0,r30,1
> struct dentry * dentry = NULL;
> unsigned long seq;
>
> do {
> seq = read_seqbegin(&rename_lock);
> dentry = __d_lookup(parent, name);
> c00bb718: 7f 83 e3 78 mr r3,r28
> c00bb71c: 7f a4 eb 78 mr r4,r29
> if (unlikely(ret & 1)) {
> c00bb720: 40 82 00 30 bne- c00bb750 <d_lookup+0x6c>
> c00bb724: 4b ff fe 61 bl c00bb584 <__d_lookup>
> if (dentry)
> c00bb728: 2f 83 00 00 cmpwi cr7,r3,0
> c00bb72c: 40 9e 00 30 bne- cr7,c00bb75c <d_lookup+0x78>
> *
> * If sequence value changed then writer changed data while in section.
> */
> static __always_inline int read_seqretry(const seqlock_t *sl, unsigned
> start)
> {
> smp_rmb();
> c00bb730: 80 1b 32 a0 lwz r0,12960(r27)
> break;
> } while (read_seqretry(&rename_lock, seq));
> c00bb734: 7f 80 f0 00 cmpw cr7,r0,r30
> c00bb738: 41 9e 00 24 beq- cr7,c00bb75c <d_lookup+0x78>
> /* Start of read calculation -- fetch last complete writer token */
> static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
> {
> unsigned ret;
>
> repeat:
> c00bb73c: 7c 1e 03 78 mr r30,r0
> ret = sl->sequence;
> smp_rmb();
> c00bb740: 73 c0 00 01 andi. r0,r30,1
> struct dentry * dentry = NULL;
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> >
> > >
> > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> > > > is probably not needed since all of this is done in kstop_machine. Also,
> > > > only the "get" is needed. If we don't fault there, we wont fault on the
> > > > put (unless we have permissions wrong, and that would be a bug).
> > >
> > > Would it ? How do we make sure the kernel text is mapped writeable ?
> >
> > We map it writeable if FTRACE is enabled.
> >
>
> Argh. See text_poke(). It's there exactly for this purpose on x86.
>
OK, I just tried text_poke and it unfortunately fails. The problem is that
it requires that the text you are changing is aligned and fits on one
page. We have no control over that.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
>
> On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
>
> > * Steven Rostedt ([email protected]) wrote:
> > >
> > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> > >
> > > >
> > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but that
> > > > > is probably not needed since all of this is done in kstop_machine. Also,
> > > > > only the "get" is needed. If we don't fault there, we wont fault on the
> > > > > put (unless we have permissions wrong, and that would be a bug).
> > > >
> > > > Would it ? How do we make sure the kernel text is mapped writeable ?
> > >
> > > We map it writeable if FTRACE is enabled.
> > >
> >
> > Argh. See text_poke(). It's there exactly for this purpose on x86.
> >
>
> OK, I just tried text_poke and it unfortunately fails. The problem is that
> it requires that the text you are changing is aligned and fits on one
> page. We have no control over that.
>
> -- Steve
>
Ok, there are two cases where it's ok :
1 - in stop_machine, considering we are not touching code executed in
NMI handlers.
2 - when using my replace_instruction_safe() which uses a temporary
breakpoint when doing the instruction replacement.
In those cases you could use text_poke_early().
See
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=7789e2c75bf03e645f15759d5dff0c1698493f92;hb=HEAD
For a use example. Basically it looks like :
360 pages[0] = virt_to_page((void *)bypass_eip);
361 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
362 BUG_ON(!vaddr);
363 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
364 (void *)addr, size);
365 /*
366 * Fill the rest with nops.
367 */
368 len = NR_NOPS - size;
369 add_nops((void *)
370 &vaddr[(bypass_eip & ~PAGE_MASK) + size],
371 len);
372 print_dbg_bytes("inserted nops",
373 &vaddr[(bypass_eip & ~PAGE_MASK) + size], len);
374 vunmap(vaddr);
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 19 Aug 2008, Eran Liberty wrote:
> Steven Rostedt wrote:
> >
> > > Testing tracer sched_switch: PASSED
> > > Testing tracer ftrace: PASSED
> > > Testing dynamic ftrace: PASSED
Do you have PREEMPT_TRACER enabled, or any other tracer for that matter?
-- Steve
> > >
> > > Oops: Exception in kernel mode, sig: 11 [#1]
> > > Exsw1600
> > > Modules linked in:
> > > NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
> > >
> >
On Tue, 19 Aug 2008, Mathieu Desnoyers wrote:
>
> Ok, there are two cases where it's ok :
>
> 1 - in stop_machine, considering we are not touching code executed in
> NMI handlers.
> 2 - when using my replace_instruction_safe() which uses a temporary
> breakpoint when doing the instruction replacement.
>
> In those cases you could use text_poke_early().
>
> See
> http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=7789e2c75bf03e645f15759d5dff0c1698493f92;hb=HEAD
>
> For a use example. Basically it looks like :
>
>
> 360 pages[0] = virt_to_page((void *)bypass_eip);
> 361 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
> 362 BUG_ON(!vaddr);
> 363 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
> 364 (void *)addr, size);
> 365 /*
> 366 * Fill the rest with nops.
> 367 */
> 368 len = NR_NOPS - size;
> 369 add_nops((void *)
> 370 &vaddr[(bypass_eip & ~PAGE_MASK) + size],
> 371 len);
> 372 print_dbg_bytes("inserted nops",
> 373 &vaddr[(bypass_eip & ~PAGE_MASK) + size], len);
> 374 vunmap(vaddr);
vunmap can not be called with interrupts disabled, and this is exactly
what my code does.
-- Steve
> Register GPR30 is always zero... (also true for the other oops)
>From my experiments, -a- non volatile register gets corrupted. Not
always the same and not always with the same value.
The corruption -seems- to happen due to corruption of the location on
the stack where it was saved to / restored from in a previous function
call or interrupt.
I haven't yet tracked down what actually whacks the stack.
Ben.
>
> Ok, there are two cases where it's ok :
>
> 1 - in stop_machine, considering we are not touching code executed in
> NMI handlers.
> 2 - when using my replace_instruction_safe() which uses a temporary
> breakpoint when doing the instruction replacement.
>
> In those cases you could use text_poke_early().
Note that vmap/vunmap will be very slow.
Ben.
Benjamin Herrenschmidt wrote:
>> Ok, there are two cases where it's ok :
>>
>> 1 - in stop_machine, considering we are not touching code executed in
>> NMI handlers.
>> 2 - when using my replace_instruction_safe() which uses a temporary
>> breakpoint when doing the instruction replacement.
>>
>> In those cases you could use text_poke_early().
>>
>
> Note that vmap/vunmap will be very slow.
>
Don't we have Nick's speedups now?
J
On Tue, 2008-08-19 at 16:58 -0700, Jeremy Fitzhardinge wrote:
> Benjamin Herrenschmidt wrote:
> >> Ok, there are two cases where it's ok :
> >>
> >> 1 - in stop_machine, considering we are not touching code executed in
> >> NMI handlers.
> >> 2 - when using my replace_instruction_safe() which uses a temporary
> >> breakpoint when doing the instruction replacement.
> >>
> >> In those cases you could use text_poke_early().
> >>
> >
> > Note that vmap/vunmap will be very slow.
> >
>
> Don't we have Nick's speedups now?
Not sure what speedups this is, but it's stlil not something you want to
do a lot ... setting up and tearing down MMU mappings isn't something
I'd like to see happening on a per-mcount basis.
Ben
Found the problem (or at least -a- problem), it's a gcc bug.
Well, first I must say the code generated by -pg is just plain
horrible :-)
Appart from that, look at the exit of, for example, __d_lookup, as
generated by gcc when ftrace is enabled:
c00c0498: 38 60 00 00 li r3,0
c00c049c: 81 61 00 00 lwz r11,0(r1)
c00c04a0: 80 0b 00 04 lwz r0,4(r11)
c00c04a4: 7d 61 5b 78 mr r1,r11
c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
c00c04ac: 7c 08 03 a6 mtlr r0
c00c04b0: 4e 80 00 20 blr
As you can see, it restores r1 -before- it pops r24..r31 off
the stack ! I let you imagine what happens if an interrupt happens
just in between those two instructions (mr and lmw). We don't do
redzones on our ABI, so basically, the registers end up corrupted
by the interrupt.
Cheers,
Ben.
On Wednesday 20 August 2008 07:08, Steven Rostedt wrote:
> On Tue, 19 Aug 2008, Mathieu Desnoyers wrote:
> > Ok, there are two cases where it's ok :
> >
> > 1 - in stop_machine, considering we are not touching code executed in
> > NMI handlers.
> > 2 - when using my replace_instruction_safe() which uses a temporary
> > breakpoint when doing the instruction replacement.
> >
> > In those cases you could use text_poke_early().
> >
> > See
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=b
> >lob;f=arch/x86/kernel/immediate.c;h=7789e2c75bf03e645f15759d5dff0c1698493f
> >92;hb=HEAD
> >
> > For a use example. Basically it looks like :
> >
> >
> > 360 pages[0] = virt_to_page((void *)bypass_eip);
> > 361 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
> > 362 BUG_ON(!vaddr);
> > 363 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
> > 364 (void *)addr, size);
> > 365 /*
> > 366 * Fill the rest with nops.
> > 367 */
> > 368 len = NR_NOPS - size;
> > 369 add_nops((void *)
> > 370 &vaddr[(bypass_eip & ~PAGE_MASK) + size],
> > 371 len);
> > 372 print_dbg_bytes("inserted nops",
> > 373 &vaddr[(bypass_eip & ~PAGE_MASK) + size],
> > len); 374 vunmap(vaddr);
>
> vunmap can not be called with interrupts disabled, and this is exactly
> what my code does.
It could be after the vmap rewrite (with a few other small tweaks).
But a) it would be less robust when called from interrupt context
and this code looks broken as it is WRT error handling; and b) it
still costs several thousand cycles to vmap+touch+vunmap...
Steven Rostedt wrote:
> On Tue, 19 Aug 2008, Eran Liberty wrote:
>
>
>> Steven Rostedt wrote:
>>
>>>
>>>
>>>> Testing tracer sched_switch: PASSED
>>>> Testing tracer ftrace: PASSED
>>>> Testing dynamic ftrace: PASSED
>>>>
>
> Do you have PREEMPT_TRACER enabled, or any other tracer for that matter?
>
> -- Steve
>
I can see stack trace & context trace, but they are derived from other
choices (I can not un select them)
cat .config | grep TRACE
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_STACKTRACE=y
# CONFIG_BACKTRACE_SELF_TEST is not set
CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_FTRACE=y
# CONFIG_SCHED_TRACER is not set
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
>
>>>> Oops: Exception in kernel mode, sig: 11 [#1]
>>>> Exsw1600
>>>> Modules linked in:
>>>> NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
>>>>
>>>>
>
>
>
On Wed, 20 Aug 2008, Eran Liberty wrote:
> Steven Rostedt wrote:
> > On Tue, 19 Aug 2008, Eran Liberty wrote:
> >
> >
> > > Steven Rostedt wrote:
> > >
> > > >
> > > > > Testing tracer sched_switch: PASSED
> > > > > Testing tracer ftrace: PASSED
> > > > > Testing dynamic ftrace: PASSED
> > > > >
> >
> > Do you have PREEMPT_TRACER enabled, or any other tracer for that matter?
> >
> > -- Steve
> >
> I can see stack trace & context trace, but they are derived from other choices
> (I can not un select them)
Yeah, those are not bad.
>
> cat .config | grep TRACE
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_HAVE_ARCH_TRACEHOOK=y
> # CONFIG_BLK_DEV_IO_TRACE is not set
> CONFIG_STACKTRACE=y
> # CONFIG_BACKTRACE_SELF_TEST is not set
> CONFIG_HAVE_FTRACE=y
> CONFIG_HAVE_DYNAMIC_FTRACE=y
> CONFIG_FTRACE=y
> # CONFIG_SCHED_TRACER is not set
> CONFIG_CONTEXT_SWITCH_TRACER=y
> CONFIG_DYNAMIC_FTRACE=y
> CONFIG_FTRACE_SELFTEST=y
> CONFIG_FTRACE_STARTUP_TEST=y
You must not have PREEMPT on, so the PREEMPT_TRACER will not show up.
The reason I asked, is that my PowerBook runs fine without PREEMPT_TRACER
but is very unstable when I have PREEMPT_TRACER enabled.
-- Steve
> >
> > > > > Oops: Exception in kernel mode, sig: 11 [#1]
> > > > > Exsw1600
> > > > > Modules linked in:
> > > > > NIP: c00bbb20 LR: c00bbb20 CTR: 00000000
> > > > >
> >
> >
> >
>
>
On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
> Found the problem (or at least -a- problem), it's a gcc bug.
>
> Well, first I must say the code generated by -pg is just plain
> horrible :-)
>
> Appart from that, look at the exit of, for example, __d_lookup, as
> generated by gcc when ftrace is enabled:
>
> c00c0498: 38 60 00 00 li r3,0
> c00c049c: 81 61 00 00 lwz r11,0(r1)
> c00c04a0: 80 0b 00 04 lwz r0,4(r11)
> c00c04a4: 7d 61 5b 78 mr r1,r11
> c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
> c00c04ac: 7c 08 03 a6 mtlr r0
> c00c04b0: 4e 80 00 20 blr
>
> As you can see, it restores r1 -before- it pops r24..r31 off
> the stack ! I let you imagine what happens if an interrupt happens
> just in between those two instructions (mr and lmw). We don't do
> redzones on our ABI, so basically, the registers end up corrupted
> by the interrupt.
Ouch! You've disassembled this without -pg too, and it does not have this
bug? What version of gcc do you have?
-- Steve
On Wed, 20 Aug 2008, Steven Rostedt wrote:
>
> On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
>
> > Found the problem (or at least -a- problem), it's a gcc bug.
> >
> > Well, first I must say the code generated by -pg is just plain
> > horrible :-)
> >
> > Appart from that, look at the exit of, for example, __d_lookup, as
> > generated by gcc when ftrace is enabled:
> >
> > c00c0498: 38 60 00 00 li r3,0
> > c00c049c: 81 61 00 00 lwz r11,0(r1)
> > c00c04a0: 80 0b 00 04 lwz r0,4(r11)
> > c00c04a4: 7d 61 5b 78 mr r1,r11
> > c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
> > c00c04ac: 7c 08 03 a6 mtlr r0
> > c00c04b0: 4e 80 00 20 blr
> >
> > As you can see, it restores r1 -before- it pops r24..r31 off
> > the stack ! I let you imagine what happens if an interrupt happens
> > just in between those two instructions (mr and lmw). We don't do
> > redzones on our ABI, so basically, the registers end up corrupted
> > by the interrupt.
>
> Ouch! You've disassembled this without -pg too, and it does not have this
> bug? What version of gcc do you have?
>
I have:
gcc (Debian 4.3.1-2) 4.3.1
c00c64c8: 81 61 00 00 lwz r11,0(r1)
c00c64cc: 7f 83 e3 78 mr r3,r28
c00c64d0: 80 0b 00 04 lwz r0,4(r11)
c00c64d4: ba eb ff dc lmw r23,-36(r11)
c00c64d8: 7d 61 5b 78 mr r1,r11
c00c64dc: 7c 08 03 a6 mtlr r0
c00c64e0: 4e 80 00 20 blr
My version looks fine. I'm thinking that this is a separate issue than
what Eran is seeing.
Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup, and
print out the end of the function dump.
Thanks,
-- Steve
Steven Rostedt wrote:
> On Wed, 20 Aug 2008, Steven Rostedt wrote:
>
>
>> On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
>>
>>
>>> Found the problem (or at least -a- problem), it's a gcc bug.
>>>
>>> Well, first I must say the code generated by -pg is just plain
>>> horrible :-)
>>>
>>> Appart from that, look at the exit of, for example, __d_lookup, as
>>> generated by gcc when ftrace is enabled:
>>>
>>> c00c0498: 38 60 00 00 li r3,0
>>> c00c049c: 81 61 00 00 lwz r11,0(r1)
>>> c00c04a0: 80 0b 00 04 lwz r0,4(r11)
>>> c00c04a4: 7d 61 5b 78 mr r1,r11
>>> c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
>>> c00c04ac: 7c 08 03 a6 mtlr r0
>>> c00c04b0: 4e 80 00 20 blr
>>>
>>> As you can see, it restores r1 -before- it pops r24..r31 off
>>> the stack ! I let you imagine what happens if an interrupt happens
>>> just in between those two instructions (mr and lmw). We don't do
>>> redzones on our ABI, so basically, the registers end up corrupted
>>> by the interrupt.
>>>
>> Ouch! You've disassembled this without -pg too, and it does not have this
>> bug? What version of gcc do you have?
>>
>>
>
> I have:
> gcc (Debian 4.3.1-2) 4.3.1
>
> c00c64c8: 81 61 00 00 lwz r11,0(r1)
> c00c64cc: 7f 83 e3 78 mr r3,r28
> c00c64d0: 80 0b 00 04 lwz r0,4(r11)
> c00c64d4: ba eb ff dc lmw r23,-36(r11)
> c00c64d8: 7d 61 5b 78 mr r1,r11
> c00c64dc: 7c 08 03 a6 mtlr r0
> c00c64e0: 4e 80 00 20 blr
>
>
> My version looks fine. I'm thinking that this is a separate issue than
> what Eran is seeing.
>
> Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup, and
> print out the end of the function dump.
>
> Thanks,
>
> -- Steve
>
>
>
>
powerpc-linux-gnu-objdump -dr --start-address=0xc00bb584 vmlinux | head
-n 100
vmlinux: file format elf32-powerpc
Disassembly of section .text:
c00bb584 <__d_lookup>:
c00bb584: 7c 08 02 a6 mflr r0
c00bb588: 90 01 00 04 stw r0,4(r1)
c00bb58c: 4b f5 5c 51 bl c00111dc <_mcount>
c00bb590: 94 21 ff d0 stwu r1,-48(r1)
c00bb594: 7c 08 02 a6 mflr r0
c00bb598: 3d 20 9e 37 lis r9,-25033
c00bb59c: bf 01 00 10 stmw r24,16(r1)
c00bb5a0: 61 29 00 01 ori r9,r9,1
c00bb5a4: 3d 60 c0 38 lis r11,-16328
c00bb5a8: 90 01 00 34 stw r0,52(r1)
c00bb5ac: 7c 60 4a 78 xor r0,r3,r9
c00bb5b0: 54 00 d9 7e rlwinm r0,r0,27,5,31
c00bb5b4: 83 84 00 00 lwz r28,0(r4)
c00bb5b8: 7c 3f 0b 78 mr r31,r1
c00bb5bc: 81 0b 1a 2c lwz r8,6700(r11)
c00bb5c0: 39 6b 1a 2c addi r11,r11,6700
c00bb5c4: 7c 00 e2 14 add r0,r0,r28
c00bb5c8: 81 4b 00 04 lwz r10,4(r11)
c00bb5cc: 7c 09 4a 78 xor r9,r0,r9
c00bb5d0: 83 24 00 04 lwz r25,4(r4)
c00bb5d4: 7d 29 44 30 srw r9,r9,r8
c00bb5d8: 81 0b 00 08 lwz r8,8(r11)
c00bb5dc: 7d 29 02 78 xor r9,r9,r0
c00bb5e0: 83 04 00 08 lwz r24,8(r4)
c00bb5e4: 7d 29 50 38 and r9,r9,r10
c00bb5e8: 55 29 10 3a rlwinm r9,r9,2,0,29
c00bb5ec: 7c 09 40 2e lwzx r0,r9,r8
c00bb5f0: 7c 9a 23 78 mr r26,r4
c00bb5f4: 7c 7b 1b 78 mr r27,r3
c00bb5f8: 2f 80 00 00 cmpwi cr7,r0,0
c00bb5fc: 7c 1e 03 78 mr r30,r0
c00bb600: 40 be 00 14 bne+ cr7,c00bb614 <__d_lookup+0x90>
c00bb604: 48 00 00 7c b c00bb680 <__d_lookup+0xfc>
c00bb608: 83 de 00 00 lwz r30,0(r30)
c00bb60c: 2f 9e 00 00 cmpwi cr7,r30,0
c00bb610: 41 9e 00 70 beq- cr7,c00bb680 <__d_lookup+0xfc>
c00bb614: 80 1e 00 00 lwz r0,0(r30)
c00bb618: 2f 80 00 00 cmpwi cr7,r0,0
c00bb61c: 41 9e 00 08 beq- cr7,c00bb624 <__d_lookup+0xa0>
c00bb620: 7c 00 02 2c dcbt r0,r0
c00bb624: 3b be ff f4 addi r29,r30,-12
c00bb628: 80 1d 00 18 lwz r0,24(r29)
c00bb62c: 7f 80 e0 00 cmpw cr7,r0,r28
c00bb630: 40 9e ff d8 bne+ cr7,c00bb608 <__d_lookup+0x84>
c00bb634: 80 1d 00 14 lwz r0,20(r29)
c00bb638: 7f 80 d8 00 cmpw cr7,r0,r27
c00bb63c: 40 9e ff cc bne+ cr7,c00bb608 <__d_lookup+0x84>
c00bb640: 81 3b 00 48 lwz r9,72(r27)
c00bb644: 38 9d 00 18 addi r4,r29,24
c00bb648: 2f 89 00 00 cmpwi cr7,r9,0
c00bb64c: 41 9e 00 50 beq- cr7,c00bb69c <__d_lookup+0x118>
c00bb650: 80 09 00 08 lwz r0,8(r9)
c00bb654: 2f 80 00 00 cmpwi cr7,r0,0
c00bb658: 41 9e 00 44 beq- cr7,c00bb69c <__d_lookup+0x118>
c00bb65c: 7f 63 db 78 mr r3,r27
c00bb660: 7c 09 03 a6 mtctr r0
c00bb664: 7f 45 d3 78 mr r5,r26
c00bb668: 4e 80 04 21 bctrl
c00bb66c: 2f 83 00 00 cmpwi cr7,r3,0
c00bb670: 41 9e 00 50 beq- cr7,c00bb6c0 <__d_lookup+0x13c>
c00bb674: 83 de 00 00 lwz r30,0(r30)
c00bb678: 2f 9e 00 00 cmpwi cr7,r30,0
c00bb67c: 40 9e ff 98 bne+ cr7,c00bb614 <__d_lookup+0x90>
c00bb680: 38 60 00 00 li r3,0
c00bb684: 81 61 00 00 lwz r11,0(r1)
c00bb688: 80 0b 00 04 lwz r0,4(r11)
c00bb68c: 7d 61 5b 78 mr r1,r11
c00bb690: bb 0b ff e0 lmw r24,-32(r11)
c00bb694: 7c 08 03 a6 mtlr r0
c00bb698: 4e 80 00 20 blr
c00bb69c: 80 04 00 04 lwz r0,4(r4)
c00bb6a0: 7f 80 c8 00 cmpw cr7,r0,r25
c00bb6a4: 40 9e ff 64 bne+ cr7,c00bb608 <__d_lookup+0x84>
c00bb6a8: 80 64 00 08 lwz r3,8(r4)
c00bb6ac: 7f 25 cb 78 mr r5,r25
c00bb6b0: 7f 04 c3 78 mr r4,r24
c00bb6b4: 4b f5 ab 65 bl c0016218 <memcmp>
c00bb6b8: 2f 83 00 00 cmpwi cr7,r3,0
c00bb6bc: 40 9e ff 4c bne+ cr7,c00bb608 <__d_lookup+0x84>
c00bb6c0: 80 1d 00 04 lwz r0,4(r29)
c00bb6c4: 70 09 00 10 andi. r9,r0,16
c00bb6c8: 40 a2 ff b8 bne- c00bb680 <__d_lookup+0xfc>
c00bb6cc: 7c 00 e8 28 lwarx r0,0,r29
c00bb6d0: 30 00 00 01 addic r0,r0,1
c00bb6d4: 7c 00 e9 2d stwcx. r0,0,r29
c00bb6d8: 40 a2 ff f4 bne- c00bb6cc <__d_lookup+0x148>
c00bb6dc: 7f a3 eb 78 mr r3,r29
c00bb6e0: 4b ff ff a4 b c00bb684 <__d_lookup+0x100>
c00bb6e4 <d_lookup>:
c00bb6e4: 7c 08 02 a6 mflr r0
c00bb6e8: 90 01 00 04 stw r0,4(r1)
On Wed, 20 Aug 2008, Eran Liberty wrote:
> Steven Rostedt wrote:
> > On Wed, 20 Aug 2008, Steven Rostedt wrote:
> >
> >
> > > On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
> > >
> > >
> > > > Found the problem (or at least -a- problem), it's a gcc bug.
> > > >
> > > > Well, first I must say the code generated by -pg is just plain
> > > > horrible :-)
> > > >
> > > > Appart from that, look at the exit of, for example, __d_lookup, as
> > > > generated by gcc when ftrace is enabled:
> > > >
> > > > c00c0498: 38 60 00 00 li r3,0
> > > > c00c049c: 81 61 00 00 lwz r11,0(r1)
> > > > c00c04a0: 80 0b 00 04 lwz r0,4(r11)
> > > > c00c04a4: 7d 61 5b 78 mr r1,r11
> > > > c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
> > > > c00c04ac: 7c 08 03 a6 mtlr r0
> > > > c00c04b0: 4e 80 00 20 blr
> > > >
> > > > As you can see, it restores r1 -before- it pops r24..r31 off
> > > > the stack ! I let you imagine what happens if an interrupt happens
> > > > just in between those two instructions (mr and lmw). We don't do
> > > > redzones on our ABI, so basically, the registers end up corrupted
> > > > by the interrupt.
> > > >
> > > Ouch! You've disassembled this without -pg too, and it does not have this
> > > bug? What version of gcc do you have?
> > >
> > >
> >
> > I have:
> > gcc (Debian 4.3.1-2) 4.3.1
> >
> > c00c64c8: 81 61 00 00 lwz r11,0(r1)
> > c00c64cc: 7f 83 e3 78 mr r3,r28
> > c00c64d0: 80 0b 00 04 lwz r0,4(r11)
> > c00c64d4: ba eb ff dc lmw r23,-36(r11)
> > c00c64d8: 7d 61 5b 78 mr r1,r11
> > c00c64dc: 7c 08 03 a6 mtlr r0
> > c00c64e0: 4e 80 00 20 blr
> >
> >
> > My version looks fine. I'm thinking that this is a separate issue than what
> > Eran is seeing.
> >
> > Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup, and
> > print out the end of the function dump.
> >
> > Thanks,
> >
> > -- Steve
> >
> >
> >
> >
> powerpc-linux-gnu-objdump -dr --start-address=0xc00bb584 vmlinux | head -n 100
>
> vmlinux: file format elf32-powerpc
>
> Disassembly of section .text:
>
> c00bb584 <__d_lookup>:
[...]
> c00bb670: 41 9e 00 50 beq- cr7,c00bb6c0 <__d_lookup+0x13c>
> c00bb674: 83 de 00 00 lwz r30,0(r30)
> c00bb678: 2f 9e 00 00 cmpwi cr7,r30,0
> c00bb67c: 40 9e ff 98 bne+ cr7,c00bb614 <__d_lookup+0x90>
> c00bb680: 38 60 00 00 li r3,0
> c00bb684: 81 61 00 00 lwz r11,0(r1)
> c00bb688: 80 0b 00 04 lwz r0,4(r11)
> c00bb68c: 7d 61 5b 78 mr r1,r11
[ BUG HERE IF INTERRUPT HAPPENS ]
> c00bb690: bb 0b ff e0 lmw r24,-32(r11)
> c00bb694: 7c 08 03 a6 mtlr r0
> c00bb698: 4e 80 00 20 blr
Yep, you have the same bug in your compiler.
-- Steve
Steven Rostedt wrote:
> On Wed, 20 Aug 2008, Eran Liberty wrote:
>
>
>> Steven Rostedt wrote:
>>
>>> On Wed, 20 Aug 2008, Steven Rostedt wrote:
>>>
>>>
>>>
>>>> On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
>>>>
>>>>
>>>>
>>>>> Found the problem (or at least -a- problem), it's a gcc bug.
>>>>>
>>>>> Well, first I must say the code generated by -pg is just plain
>>>>> horrible :-)
>>>>>
>>>>> Appart from that, look at the exit of, for example, __d_lookup, as
>>>>> generated by gcc when ftrace is enabled:
>>>>>
>>>>> c00c0498: 38 60 00 00 li r3,0
>>>>> c00c049c: 81 61 00 00 lwz r11,0(r1)
>>>>> c00c04a0: 80 0b 00 04 lwz r0,4(r11)
>>>>> c00c04a4: 7d 61 5b 78 mr r1,r11
>>>>> c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
>>>>> c00c04ac: 7c 08 03 a6 mtlr r0
>>>>> c00c04b0: 4e 80 00 20 blr
>>>>>
>>>>> As you can see, it restores r1 -before- it pops r24..r31 off
>>>>> the stack ! I let you imagine what happens if an interrupt happens
>>>>> just in between those two instructions (mr and lmw). We don't do
>>>>> redzones on our ABI, so basically, the registers end up corrupted
>>>>> by the interrupt.
>>>>>
>>>>>
>>>> Ouch! You've disassembled this without -pg too, and it does not have this
>>>> bug? What version of gcc do you have?
>>>>
>>>>
>>>>
>>> I have:
>>> gcc (Debian 4.3.1-2) 4.3.1
>>>
>>> c00c64c8: 81 61 00 00 lwz r11,0(r1)
>>> c00c64cc: 7f 83 e3 78 mr r3,r28
>>> c00c64d0: 80 0b 00 04 lwz r0,4(r11)
>>> c00c64d4: ba eb ff dc lmw r23,-36(r11)
>>> c00c64d8: 7d 61 5b 78 mr r1,r11
>>> c00c64dc: 7c 08 03 a6 mtlr r0
>>> c00c64e0: 4e 80 00 20 blr
>>>
>>>
>>> My version looks fine. I'm thinking that this is a separate issue than what
>>> Eran is seeing.
>>>
>>> Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup, and
>>> print out the end of the function dump.
>>>
>>> Thanks,
>>>
>>> -- Steve
>>>
>>>
>>>
>>>
>>>
>> powerpc-linux-gnu-objdump -dr --start-address=0xc00bb584 vmlinux | head -n 100
>>
>> vmlinux: file format elf32-powerpc
>>
>> Disassembly of section .text:
>>
>> c00bb584 <__d_lookup>:
>>
>
> [...]
>
>
>> c00bb670: 41 9e 00 50 beq- cr7,c00bb6c0 <__d_lookup+0x13c>
>> c00bb674: 83 de 00 00 lwz r30,0(r30)
>> c00bb678: 2f 9e 00 00 cmpwi cr7,r30,0
>> c00bb67c: 40 9e ff 98 bne+ cr7,c00bb614 <__d_lookup+0x90>
>> c00bb680: 38 60 00 00 li r3,0
>> c00bb684: 81 61 00 00 lwz r11,0(r1)
>> c00bb688: 80 0b 00 04 lwz r0,4(r11)
>> c00bb68c: 7d 61 5b 78 mr r1,r11
>>
>
> [ BUG HERE IF INTERRUPT HAPPENS ]
>
>
>> c00bb690: bb 0b ff e0 lmw r24,-32(r11)
>> c00bb694: 7c 08 03 a6 mtlr r0
>> c00bb698: 4e 80 00 20 blr
>>
>
> Yep, you have the same bug in your compiler.
>
> -- Steve
>
Hmm... so whats now?
Is there a way to prove this scenario is indeed the one that caused the
opps?
-- Liberty
On Wed, 2008-08-20 at 09:14 -0400, Steven Rostedt wrote:
> On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
>
> > Found the problem (or at least -a- problem), it's a gcc bug.
> >
> > Well, first I must say the code generated by -pg is just plain
> > horrible :-)
> >
> > Appart from that, look at the exit of, for example, __d_lookup, as
> > generated by gcc when ftrace is enabled:
> >
> > c00c0498: 38 60 00 00 li r3,0
> > c00c049c: 81 61 00 00 lwz r11,0(r1)
> > c00c04a0: 80 0b 00 04 lwz r0,4(r11)
> > c00c04a4: 7d 61 5b 78 mr r1,r11
> > c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
> > c00c04ac: 7c 08 03 a6 mtlr r0
> > c00c04b0: 4e 80 00 20 blr
> >
> > As you can see, it restores r1 -before- it pops r24..r31 off
> > the stack ! I let you imagine what happens if an interrupt happens
> > just in between those two instructions (mr and lmw). We don't do
> > redzones on our ABI, so basically, the registers end up corrupted
> > by the interrupt.
>
> Ouch! You've disassembled this without -pg too, and it does not have this
> bug? What version of gcc do you have?
Segher was looking at this a bit this morning. He thinks it's really
-fno-omit-frame-pointer that is causing this. That really shouldn't
even be set on PowerPC, but FTRACE uses select which overrides the
depends on stuff in Kconfig.
Segher can probably tell you more once his email is back up.
josh
On Wed, 20 Aug 2008, Josh Boyer wrote:
>
> Segher was looking at this a bit this morning. He thinks it's really
> -fno-omit-frame-pointer that is causing this. That really shouldn't
> even be set on PowerPC, but FTRACE uses select which overrides the
> depends on stuff in Kconfig.
I can easily make a patch that makes that select an x86 only.
-- Steve
On Wed, 20 Aug 2008 10:22:49 -0400 (EDT)
Steven Rostedt <[email protected]> wrote:
>
> On Wed, 20 Aug 2008, Josh Boyer wrote:
> >
> > Segher was looking at this a bit this morning. He thinks it's really
> > -fno-omit-frame-pointer that is causing this. That really shouldn't
> > even be set on PowerPC, but FTRACE uses select which overrides the
> > depends on stuff in Kconfig.
>
> I can easily make a patch that makes that select an x86 only.
That's probably a first step, but you might want to wait until Segher
can fill in more details. I'm sort of just relaying what he and I were
talking about on IRC. IIRC, he was testing builds with and without
both -pg and -fno-omit-frame-pointer and found the bug only when
-fno-omit-frame-pointer was present.
josh
On 8/20/08, Eran Liberty <[email protected]> wrote:
> Steven Rostedt wrote:
>
> > On Wed, 20 Aug 2008, Eran Liberty wrote:
> >
> >
> >
> > > Steven Rostedt wrote:
> > >
> > >
> > > > On Wed, 20 Aug 2008, Steven Rostedt wrote:
> > > >
> > > >
> > > >
> > > > > On Wed, 20 Aug 2008, Benjamin Herrenschmidt wrote:
> > > > >
> > > > >
> > > > >
> > > > > > Found the problem (or at least -a- problem), it's a gcc bug.
> > > > > >
> > > > > > Well, first I must say the code generated by -pg is just plain
> > > > > > horrible :-)
> > > > > >
> > > > > > Appart from that, look at the exit of, for example, __d_lookup, as
> > > > > > generated by gcc when ftrace is enabled:
> > > > > >
> > > > > > c00c0498: 38 60 00 00 li r3,0
> > > > > > c00c049c: 81 61 00 00 lwz r11,0(r1)
> > > > > > c00c04a0: 80 0b 00 04 lwz r0,4(r11)
> > > > > > c00c04a4: 7d 61 5b 78 mr r1,r11
> > > > > > c00c04a8: bb 0b ff e0 lmw r24,-32(r11)
> > > > > > c00c04ac: 7c 08 03 a6 mtlr r0
> > > > > > c00c04b0: 4e 80 00 20 blr
> > > > > >
> > > > > > As you can see, it restores r1 -before- it pops r24..r31 off
> > > > > > the stack ! I let you imagine what happens if an interrupt happens
> > > > > > just in between those two instructions (mr and lmw). We don't do
> > > > > > redzones on our ABI, so basically, the registers end up corrupted
> > > > > > by the interrupt.
> > > > > >
> > > > > >
> > > > > Ouch! You've disassembled this without -pg too, and it does not
> have this
> > > > > bug? What version of gcc do you have?
> > > > >
> > > > >
> > > > >
> > > > I have:
> > > > gcc (Debian 4.3.1-2) 4.3.1
> > > >
> > > > c00c64c8: 81 61 00 00 lwz r11,0(r1)
> > > > c00c64cc: 7f 83 e3 78 mr r3,r28
> > > > c00c64d0: 80 0b 00 04 lwz r0,4(r11)
> > > > c00c64d4: ba eb ff dc lmw r23,-36(r11)
> > > > c00c64d8: 7d 61 5b 78 mr r1,r11
> > > > c00c64dc: 7c 08 03 a6 mtlr r0
> > > > c00c64e0: 4e 80 00 20 blr
> > > >
> > > >
> > > > My version looks fine. I'm thinking that this is a separate issue
> than what
> > > > Eran is seeing.
> > > >
> > > > Eran, can you do an "objdump -dr vmlinux" and search for __d_lookup,
> and
> > > > print out the end of the function dump.
> > > >
> > > > Thanks,
> > > >
> > > > -- Steve
> > > >
> > > >
> > > >
> > > >
> > > >
> > > powerpc-linux-gnu-objdump -dr --start-address=0xc00bb584 vmlinux | head
> -n 100
> > >
> > > vmlinux: file format elf32-powerpc
> > >
> > > Disassembly of section .text:
> > >
> > > c00bb584 <__d_lookup>:
> > >
> > >
> >
> > [...]
> >
> >
> >
> > > c00bb670: 41 9e 00 50 beq- cr7,c00bb6c0 <__d_lookup+0x13c>
> > > c00bb674: 83 de 00 00 lwz r30,0(r30)
> > > c00bb678: 2f 9e 00 00 cmpwi cr7,r30,0
> > > c00bb67c: 40 9e ff 98 bne+ cr7,c00bb614 <__d_lookup+0x90>
> > > c00bb680: 38 60 00 00 li r3,0
> > > c00bb684: 81 61 00 00 lwz r11,0(r1)
> > > c00bb688: 80 0b 00 04 lwz r0,4(r11)
> > > c00bb68c: 7d 61 5b 78 mr r1,r11
> > >
> > >
> >
> > [ BUG HERE IF INTERRUPT HAPPENS ]
> >
> >
> >
> > > c00bb690: bb 0b ff e0 lmw r24,-32(r11)
> > > c00bb694: 7c 08 03 a6 mtlr r0
> > > c00bb698: 4e 80 00 20 blr
> > >
> > >
> >
> > Yep, you have the same bug in your compiler.
> >
> > -- Steve
> >
> >
> Hmm... so whats now?
>
> Is there a way to prove this scenario is indeed the one that caused the
> opps?
Manually edit the broken binary to change the order of the restore and
see if the problem disappears. That will keep everything else
constant.
>
> -- Liberty
>
> _______________________________________________
> Linuxppc-dev mailing list
> [email protected]
> https://ozlabs.org/mailman/listinfo/linuxppc-dev
>
--
Jon Smirl
[email protected]
On Wed, 20 Aug 2008, Jon Smirl wrote:
> > >
> > > > c00bb690: bb 0b ff e0 lmw r24,-32(r11)
> > > > c00bb694: 7c 08 03 a6 mtlr r0
> > > > c00bb698: 4e 80 00 20 blr
> > > >
> > > >
> > >
> > > Yep, you have the same bug in your compiler.
> > >
> > > -- Steve
> > >
> > >
> > Hmm... so whats now?
> >
> > Is there a way to prove this scenario is indeed the one that caused the
> > opps?
>
> Manually edit the broken binary to change the order of the restore and
> see if the problem disappears. That will keep everything else
> constant.
Does it only happen on this function? Or is it happening on others as
well. So manually fixing it here won't help much unless we fix it
everywhere.
-- Steve
On Wed, 20 Aug 2008, Eran Liberty wrote:
> >
> >
> > > c00bb670: 41 9e 00 50 beq- cr7,c00bb6c0 <__d_lookup+0x13c>
> > > c00bb674: 83 de 00 00 lwz r30,0(r30)
> > > c00bb678: 2f 9e 00 00 cmpwi cr7,r30,0
> > > c00bb67c: 40 9e ff 98 bne+ cr7,c00bb614 <__d_lookup+0x90>
> > > c00bb680: 38 60 00 00 li r3,0
> > > c00bb684: 81 61 00 00 lwz r11,0(r1)
> > > c00bb688: 80 0b 00 04 lwz r0,4(r11)
> > > c00bb68c: 7d 61 5b 78 mr r1,r11
> > >
> >
> > [ BUG HERE IF INTERRUPT HAPPENS ]
> >
> >
> > > c00bb690: bb 0b ff e0 lmw r24,-32(r11)
> > > c00bb694: 7c 08 03 a6 mtlr r0
> > > c00bb698: 4e 80 00 20 blr
> > >
> >
> > Yep, you have the same bug in your compiler.
> >
> > -- Steve
> >
> Hmm... so whats now?
Try anothe gcc version ;-)
>
> Is there a way to prove this scenario is indeed the one that caused the opps?
I'm not convinced that this is the bug you were seeing. Do you still get
the crash if you disable dynamic ftrace but leave ftrace on?
-- Steve
Jon Smirl wrote:
> Manually edit the broken binary to change the order of the restore and
> see if the problem disappears. That will keep everything else
> constant.
>
>
checked with objdump where __d_lookup()
> powerpc-linux-gnu-objdump -dr --start-address=0xc0065790 vmlinux |
grep "<__d_lookup>:"
c00b91d8 <__d_lookup>:
> liberty@liberty:~/svn/exsw1600-2.6.27-rc2$ powerpc-linux-gnu-objdump
-dr --start-address=0xc00b91d8 vmlinux| head -n 100
vmlinux: file format elf32-powerpc
Disassembly of section .text:
c00b91d8 <__d_lookup>:
c00b91d8: 7c 08 02 a6 mflr r0
c00b91dc: 90 01 00 04 stw r0,4(r1)
....
c00b92bc: 4e 80 04 21 bctrl
c00b92c0: 2f 83 00 00 cmpwi cr7,r3,0
c00b92c4: 41 9e 00 50 beq- cr7,c00b9314 <__d_lookup+0x13c>
c00b92c8: 83 de 00 00 lwz r30,0(r30)
c00b92cc: 2f 9e 00 00 cmpwi cr7,r30,0
c00b92d0: 40 9e ff 98 bne+ cr7,c00b9268 <__d_lookup+0x90>
c00b92d4: 38 60 00 00 li r3,0
c00b92d8: 81 61 00 00 lwz r11,0(r1)
c00b92dc: 80 0b 00 04 lwz r0,4(r11)
c00b92e0: 7d 61 5b 78 mr r1,r11
<=== As explained by Steve, these two should be replaced ===>
c00b92e4: bb 0b ff e0 lmw r24,-32(r11)
c00b92e8: 7c 08 03 a6 mtlr r0
c00b92ec: 4e 80 00 20 blr
c00b92f0: 80 04 00 04 lwz r0,4(r4)
....
c00b9330: 7f a3 eb 78 mr r3,r29
c00b9334: 4b ff ff a4 b c00b92d8 <__d_lookup+0x100>
c00b9338 <d_lookup>:
c00b9338: 7c 08 02 a6 mflr r0
on the target I fired up xmon and replaced them.
~ # echo x > /proc/sysrq-trigger
SysRq : Entering xmon
Vector: 0 at [df51fdb8]
pc: c0025960: sysrq_handle_xmon+0x60/0x64
lr: c0025960: sysrq_handle_xmon+0x60/0x64
sp: df51fe80
msr: 21000
current = 0xdc22a9a0
pid = 1698, comm = echo
WARNING: exception is not recoverable, can't continue
enter ? for help
[df51fe90] c0193c38 __handle_sysrq+0xa8/0x178
[df51fec0] c00ee818 write_sysrq_trigger+0x78/0x7c
[df51fed0] c00e65e4 proc_reg_write+0x5c/0x84
[df51fef0] c00a299c vfs_write+0xc8/0x180
[df51ff10] c00a2f40 sys_write+0x5c/0xa4
[df51ff40] c0010554 ret_from_syscall+0x0/0x3c
SP (bffe87e0) is in userspace
mon> di c00b92d0
c00b92d0 409eff98 bne cr7,c00b9268 # __d_lookup+0x90/0x160
c00b92d4 38600000 li r3,0
c00b92d8 81610000 lwz r11,0(r1)
c00b92dc 800b0004 lwz r0,4(r11)
c00b92e0 7d615b78 mr r1,r11
<=== wrong order ===>
c00b92e4 bb0bffe0 lmw r24,-32(r11)
c00b92e8 7c0803a6 mtlr r0
c00b92ec 4e800020 blr
c00b92f0 80040004 lwz r0,4(r4)
c00b92f4 7f80c800 cmpw cr7,r0,r25
c00b92f8 409eff64 bne cr7,c00b925c # __d_lookup+0x84/0x160
c00b92fc 80640008 lwz r3,8(r4)
c00b9300 7f25cb78 mr r5,r25
c00b9304 7f04c378 mr r4,r24
c00b9308 4bf5ccf1 bl c0015ff8 # memcmp+0x0/0x30
c00b930c 2f830000 cmpwi cr7,r3,0
mon> m c00b92e0
c00b92e0 7d l
c00b92e0 7d615b78 bb0bffe0
c00b92e4 bb0bffe0 7d615b78
c00b92e8 7c0803a6 x
mon> di c00b92d0
c00b92d0 409eff98 bne cr7,c00b9268 # __d_lookup+0x90/0x160
c00b92d4 38600000 li r3,0
c00b92d8 81610000 lwz r11,0(r1)
c00b92dc 800b0004 lwz r0,4(r11)
c00b92e0 bb0bffe0 lmw r24,-32(r11)
<=== right order ===>
c00b92e4 7d615b78 mr r1,r11
c00b92e8 7c0803a6 mtlr r0
c00b92ec 4e800020 blr
c00b92f0 80040004 lwz r0,4(r4)
c00b92f4 7f80c800 cmpw cr7,r0,r25
c00b92f8 409eff64 bne cr7,c00b925c # __d_lookup+0x84/0x160
c00b92fc 80640008 lwz r3,8(r4)
c00b9300 7f25cb78 mr r5,r25
c00b9304 7f04c378 mr r4,r24
c00b9308 4bf5ccf1 bl c0015ff8 # memcmp+0x0/0x30
c00b930c 2f830000 cmpwi cr7,r3,0
mon> X
~ #
And took it for a test drive
~# while [ 1 ] ; do find / > /dev/null ; echo . ; done
.
.
.
[ 10 min later ]
.
.
.
Stable! Yeepee.
So how do we get this to be right in the first place and everywhere?
-- Liberty
On Wed, 20 Aug 2008, Eran Liberty wrote:
> mon> X
> ~ #
>
> And took it for a test drive
> ~# while [ 1 ] ; do find / > /dev/null ; echo . ; done
> .
> .
> .
> [ 10 min later ]
> .
> .
> .
>
> Stable! Yeepee.
Eran, thanks a lot for testing this!
>
> So how do we get this to be right in the first place and everywhere?
>
This is a bug in gcc, and should be mentioned to the gcc maintainers.
-- Steve
> Hmm... so whats now?
>
> Is there a way to prove this scenario is indeed the one that caused the
> opps?
I've verified in sim that the crash in do_lookup() I was seeing (similar
backtrace) was indeed caused by an interrupt coming between those two
instructions and clobbering the saved GPRs.
So I think it's very likely our problem.
Ben.
On Thu, 21 Aug 2008, Benjamin Herrenschmidt wrote:
>
> > > As you can see, it restores r1 -before- it pops r24..r31 off
> > > the stack ! I let you imagine what happens if an interrupt happens
> > > just in between those two instructions (mr and lmw). We don't do
> > > redzones on our ABI, so basically, the registers end up corrupted
> > > by the interrupt.
> >
> > Ouch! You've disassembled this without -pg too, and it does not have this
> > bug? What version of gcc do you have?
>
> Oops, should have mentioned it ;-)
>
> 4.2.3 (Ubuntu 4.2.3-2ubuntu7)
I'm running a 4.3.1 (Debian 4.3.1-2) on my powerbook. Perhaps the issue is
already taken care of.
-- Steve
> > As you can see, it restores r1 -before- it pops r24..r31 off
> > the stack ! I let you imagine what happens if an interrupt happens
> > just in between those two instructions (mr and lmw). We don't do
> > redzones on our ABI, so basically, the registers end up corrupted
> > by the interrupt.
>
> Ouch! You've disassembled this without -pg too, and it does not have this
> bug? What version of gcc do you have?
Oops, should have mentioned it ;-)
4.2.3 (Ubuntu 4.2.3-2ubuntu7)
Ben.