On Sat 29-09-12 21:07:27, Alexander Holler wrote:
> Am 27.09.2012 22:03, schrieb Jan Kara:
> >On Thu 27-09-12 17:46:48, Alexander Holler wrote:
> >>Hello,
> >>
> >>Am 27.09.2012 17:12, schrieb Jan Kara:
> >>> Just some thoughts about your oops:
> >>>The assertion which fails is:
> >>>BUG_ON(!list_empty(&bh->b_assoc_buffers));
> >>>
> >>>Now b_assoc_buffers isn't used very much. In particular ext4 which you seem
> >>>to be using doesn't use this list at all (except when mounted in nojournal
> >>>mode but that doesn't seem to be your case). That would point rather
> >>>strongly at a memory corruption issue.
> >>>
> >>>So if you can reproduce the oops, it might be interesting to print
> >>>bh->b_assoc_buffers.next and &bh->b_assoc_buffers.next if the list is found
> >>>to be non-empty.
> >>
> >>Hmm, a loose pointer would explain it all too. Especially the cases
> >>when I just have seen wrong content in the archive without having
> >>any oops. I try to reproduce it with
> >>
> >>pr_info("AHO: %p %p\n", bh->b_assoc_buffers.next,
> >>&bh->b_assoc_buffers.next);
> >>after the BUG_ON().
> > It should have been:
> > if (!list_empty(&bh->b_assoc_buffers))
> > pr_info("AHO: %p %p\n", bh->b_assoc_buffers.next,
> > &bh->b_assoc_buffers.next);
> > *before* BUG_ON().
> >
> > What you saw in the logs were just pointers showing the list is empty
> >(naturally as otherwise we'd see the BUG_ON trigger).
>
> Yes, I've already wondered what you want to read in the output. ;)
>
> Btw. I've just had that bug while doing sha1sum /dev/sr0, where sr0
> is a dvd-writer attached to a sata-port. No USB involved. Before the
> sha1sum I did an mbuffer < /dev/sr0 | bzip2smp >foo.iso.bz2. But
> that needed only a few minutes (8GB) and I haven't had any throttle
> events or similiar, so Idon't think the cpu (or whatever) got hot.
>
> ---------
> Sep 29 20:38:20 krabat kernel: [ 1652.879952] ------------[ cut here
> ]------------
> Sep 29 20:38:20 krabat kernel: [ 1652.879956] kernel BUG at
> fs/buffer.c:3199!
> Sep 29 20:38:20 krabat kernel: [ 1652.879957] invalid opcode: 0000 [#1] SMP
> Sep 29 20:38:20 krabat kernel: [ 1652.879959] CPU 2
> Sep 29 20:38:20 krabat kernel: [ 1652.879960] Modules linked in: nfs
> rfcomm fuse hidp ebtable_nat ebtables ipt_MASQUERADE xt_CHECKSUM
> iptable_mangle iptable_nat nf_nat bridge stp llc it87 hwmon_vid
> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack iptable_filter btusb bluetooth rfkill joydev
> hid_logitech ff_memless usbhid pata_jmicron binfmt_misc usb_storage
> uas virtio_blk virtio_net virtio_balloon virtio_pci virtio_ring
> virtio vhost_net tun macvtap macvlan snd_hda_codec_hdmi
> snd_hda_codec_realtek coretemp kvm_intel snd_hda_intel snd_hda_codec
> kvm snd_hwdep uhci_hcd uinput snd_seq crc32c_intel snd_seq_device
> sr_mod snd_pcm xhci_hcd cdrom i7core_edac microcode ehci_hcd
> snd_page_alloc dm_mod edac_core fglrx(PO) r8169 snd_timer lpc_ich
> mii snd jmicron mfd_core soundcore agpgart usbcore usb_common nfsd
> nfs_acl auth_rpcgss lockd sunrpc ipv6 [last unloaded:
> scsi_wait_scan]
> Sep 29 20:38:20 krabat kernel: [ 1652.879992]
> Sep 29 20:38:20 krabat kernel: [ 1652.879993] Pid: 4670, comm:
> sha1sum Tainted: P O 3.5.4-00009-gfa43f23-dirty #228
BTW, fglrx moodule taints the kernel because it is a proprietary driver.
Can you reproduce the issue without this module loaded?
Honza
Hello,
Am 01.10.2012 11:10, schrieb Jan Kara:
>> sha1sum Tainted: P O 3.5.4-00009-gfa43f23-dirty #228
> BTW, fglrx moodule taints the kernel because it is a proprietary driver.
I know.
> Can you reproduce the issue without this module loaded?
I will try it with a clean 3.6. Most of the 9 additional patches here
are for ARM boxes, but anyway. I will need a few days.
Regards,
Alexander
Am 01.10.2012 11:21, schrieb Alexander Holler:
> Hello,
>
> Am 01.10.2012 11:10, schrieb Jan Kara:
>
>>> sha1sum Tainted: P O 3.5.4-00009-gfa43f23-dirty #228
>> BTW, fglrx moodule taints the kernel because it is a proprietary
>> driver.
>
> I know.
>
>> Can you reproduce the issue without this module loaded?
>
> I will try it with a clean 3.6. Most of the 9 additional patches here
> are for ARM boxes, but anyway. I will need a few days.
Just tried my "tar cp . | mbuffer | bzip2smp >/usb3/ext4/foo.tar.bz2
using a kernel 3.6 without using fglrx and without any additional
patches. The first try already ended up in a broken archive (tar djf =>
bzip2: Data integrity error when decompressing), but (again) without the
BUG() in fs/buffer.c getting hit. Will do some more tests, trying hit
that BUG().
Regards,
Alexander
Am 02.10.2012 11:30, schrieb Alexander Holler:
> Am 01.10.2012 11:21, schrieb Alexander Holler:
>> Hello,
>>
>> Am 01.10.2012 11:10, schrieb Jan Kara:
>>
>>>> sha1sum Tainted: P O 3.5.4-00009-gfa43f23-dirty #228
>>> BTW, fglrx moodule taints the kernel because it is a proprietary
>>> driver.
>>
>> I know.
>>
>>> Can you reproduce the issue without this module loaded?
>>
>> I will try it with a clean 3.6. Most of the 9 additional patches here
>> are for ARM boxes, but anyway. I will need a few days.
>
> Just tried my "tar cp . | mbuffer | bzip2smp >/usb3/ext4/foo.tar.bz2
> using a kernel 3.6 without using fglrx and without any additional
> patches. The first try already ended up in a broken archive (tar djf =>
> bzip2: Data integrity error when decompressing), but (again) without the
> BUG() in fs/buffer.c getting hit. Will do some more tests, trying hit
> that BUG().
I found the problem. Looks like either the RAM, CPU or the stuff
inbetween is broken because I see some memory failures (1 bit flipped on
some bytes) when using memtest(86+).
The people which are responsible that the chips for "consumer"-HW and
laptops got their (already included) ECC functionality disabled should
get hit with Googles (now 3y old) study on that topic all the day.
Leaving customers in danger by not offering them at least the
possibility to use ECC RAM is just stupid.
Sorry to everyone whose time I've wasted.
Regards,
Alexander
> The people which are responsible that the chips for "consumer"-HW and
> laptops got their (already included) ECC functionality disabled should
> get hit with Googles (now 3y old) study on that topic all the day.
> Leaving customers in danger by not offering them at least the
> possibility to use ECC RAM is just stupid.
For the amount of RAM in devices now days then yes probably a good idea.
Adding to the problem is that there is an active market in fake brandname
DIMMs.
Alan
Am 14.10.2012 14:27, schrieb Alan Cox:
>> The people which are responsible that the chips for "consumer"-HW and
>> laptops got their (already included) ECC functionality disabled should
>> get hit with Googles (now 3y old) study on that topic all the day.
>> Leaving customers in danger by not offering them at least the
>> possibility to use ECC RAM is just stupid.
>
> For the amount of RAM in devices now days then yes probably a good idea.
> Adding to the problem is that there is an active market in fake brandname
> DIMMs.
My solution is now that I will add memtest=1 to all my own Linux systems
which don't have ECC (unfortunatley currently almost all). That doesn't
cost that much time at startup and will give me at least a small chance
to spot those extremely hard to find problems.
Regards,
Alexander