LinuxLists.cc - OCFS2 Filesystem inconsistency across nodes

2006-02-10 05:36:13

Subject: OCFS2 Filesystem inconsistency across nodes

(I'm posting this to lkml since [email protected] doesn't seem to be
accepting new subscription requests)

Hi all

I'm testing OCFS2 on a 3 node cluster (2 nodes are Dual Xeon 512MB of RAM,
the third one is a Dual AMD Athlon with 1GB of RAM) with gigabit ethernet
interconnect and using an iSCSI target box for shared storage.
I'm using kernel version 2.6.16-rc2-git3 (gcc 4.0.3 from Debian) and compiled
the iSCSI modules from the latest open-iscsi tree.
Ocfs2-tools is version 1.1.5 from Debian distro.
I followed the procedures for cluster configuration from the "OCFS2 User's
Guide", onlined the cluster and formated a 2.3TB shared volume with

mkfs.ocfs2 -b 4K -C 64K -N 4 -L OCFSTest1 /dev/sda

So after mounting this shared volume on all three nodes I played with it by
creating/deleting/writing to files on different nodes. And this is the part
where the fun begins:

On node1 I can do:

#mkdir dir1
#ls -l
total 4
drwxr-xr-x 2 ctpm ctpm 4096 2006-02-10 04:30 dir1

On node2 I do:

#ls -l
total 4
drwxr-xr-x 2 ctpm ctpm 4096 Feb 10 04:30 dir1

On node3 I do:
#ls -l
total 0

Whooops!... now that directory should have appeared on all three nodes. It
doesn't; not even if I wait half an hour.
I can reproduce the above behavior touching or writing to files instead of
directories. It seems to be random, so sometimes it works and the file
appears on all three nodes and sometimes only on 1 or 2 of them.
Node order doesn't seem to matter, so I don't think it's a problem w/ the
configuration on one of the nodes.
Another example is to create a file, write something to it and then try to
read it from another node (when it happens to be visible at all) with cat
which results in

cat: file1: Input/output error

and lots of this on dmesg output:

(1925,0):ocfs2_extent_map_lookup_read:362 ERROR: status = -3
(1925,0):ocfs2_extent_map_get_blocks:818 ERROR: status = -3
(1925,0):ocfs2_get_block:166 ERROR: Error -3 from get_blocks(0xd6a7d4bc,
436976, 1, 0, NULL)
(1925,0):ocfs2_extent_map_lookup_read:362 ERROR: status = -3
(1925,0):ocfs2_extent_map_get_blocks:818 ERROR: status = -3
(1925,0):ocfs2_get_block:166 ERROR: Error -3 from get_blocks(0xd6a7d4bc,
436977, 1, 0, NULL)
(1925,0):ocfs2_extent_map_lookup_read:362 ERROR: status = -3
(1925,0):ocfs2_extent_map_get_blocks:818 ERROR: status = -3

At first I thought this might be caused by the metadata info being propagated
to the other nodes but the caches not being flushed to disk on the node that
wrote to a file. So I tested this by copying ~2GB sized files to try to cause
some memory pressure, yet with the same kind of disappointing results.

Yet another interesting thing results when I create one big file. The file is
shown on the directory listing on the node that wrote it. The other ones
don't see it. I umount the filesystem from the nodes one by one starting with
the one which wrote this file, and then the others. I then remount the
filesystem on all of them and the file I created is gone. So not even a flush
caused by unmounting assures that a file gets to persistant storage.

Note that I'm using regular stuff to write the files, like copy, cat, echo,
and shell redirects. No fancy mmap stuff or test programs.

I also tested OCFS2 with a smaller volume over NBD from a remote machine with
exactly the same kind of behavior, so I don't think this is related with the
iSCSI target volume or open-iscsi modules or with the disk box (on a side
node this box from Promise worked flawlessly with XFS and the same open-iscsi
modules on any machine).

I'd like to know if anyone on the list has had the opportunity of testing
OCFS2 or had similar problems. OTOH, if I'm wrongly assuming something about
OCFS2 which I shouldn't be, please tell me and I'll apologise for wasting
your time ;-)
I googled for this subject and found suprisingly few info about real world
OCFS2 testing. Only lots of happy guys because of it being merged for
2.6.16 ;-)

I'm willing to make any tests or apply any patches you want. I'll be trying
to keep the machines and the disk box for as many days as possible, so please
try to bug me if you think these are real bugs and you want me to test fixes
before 2.6.16 comes out.
If you need kernel .config or any other info please ask.

Thanks

Best regards

Claudio Martins

2006-02-10 06:46:17

by Mark Fasheh

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

Hi,

On Fri, Feb 10, 2006 at 05:36:02AM +0000, Claudio Martins wrote:
>
> (I'm posting this to lkml since [email protected] doesn't seem
> to be accepting new subscription requests)
Thanks for testing this out :) There does indeed seem to be something wrong
with our mailing list, so thanks for pointing that out too.

<snip testing info>

> At first I thought this might be caused by the metadata info being propagated
> to the other nodes but the caches not being flushed to disk on the node that
> wrote to a file. So I tested this by copying ~2GB sized files to try to cause
> some memory pressure, yet with the same kind of disappointing results.
It definitely looks like the messages you're getting are due to some nodes
reading old or invalid metadata. If the cache is bad on the other nodes,
re-writing blocks from the same node won't necessarily make things better.

> I'd like to know if anyone on the list has had the opportunity of testing
> OCFS2 or had similar problems. OTOH, if I'm wrongly assuming something about
> OCFS2 which I shouldn't be, please tell me and I'll apologise for wasting
> your time ;-)
No, it doesn't seem like you're assuming any behavior that it shouldn't be
doing, so there's certainly something wrong going on. I have some
suggestions below. As far as testing goes, we have a pretty fair number of
folks running large production systems on this so I'd definitely say it's
getting tested ;) That said, things can always get temporarily broken,
which is where hearing from folks like you helps alot.

> I'm willing to make any tests or apply any patches you want. I'll be trying
> to keep the machines and the disk box for as many days as possible, so please
> try to bug me if you think these are real bugs and you want me to test fixes
> before 2.6.16 comes out.
> If you need kernel .config or any other info please ask.
Great. We'll keep things simple at first. If I could get a copy of the
/etc/ocfs2/cluster.conf files from each node that'd be great. A full log of the
OCFS2 messages you see on each node, starting from mount to unmount would also
help. That includes any dlm_* messages - in particular the ones printed when
a node mounts and unmounts. If you're using any mount options it'd be
helpful to know those too.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-02-10 07:12:50

by Claudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Friday 10 February 2006 05:49, Wim Coekaerts wrote:
> do you mix x86 and x86-64 ?

No, this is all x86. One of the nodes is a Dual AthlonMP 1800+, the other two
are Dual Intel Xeon 2.6GHz with HT enabled.

Regards

Claudio

2006-02-10 07:20:07

by Claudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Friday 10 February 2006 06:46, Mark Fasheh wrote:
>
> Great. We'll keep things simple at first. If I could get a copy of the
> /etc/ocfs2/cluster.conf files from each node that'd be great. A full log of
> the OCFS2 messages you see on each node, starting from mount to unmount
> would also help. That includes any dlm_* messages - in particular the ones
> printed when a node mounts and unmounts. If you're using any mount options
> it'd be helpful to know those too.

Hi Mark

Thanks for the quick reply. I'll redo the steps from scratch, capture the
messages since the first mount and send that info ASAP.

Thanks for the help

Regards

Claudio

2006-02-11 05:41:06

by Claudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Friday 10 February 2006 06:46, Mark Fasheh wrote:
> Great. We'll keep things simple at first. If I could get a copy of the
> /etc/ocfs2/cluster.conf files from each node that'd be great. A full log of
> the OCFS2 messages you see on each node, starting from mount to unmount
> would also help. That includes any dlm_* messages - in particular the ones
> printed when a node mounts and unmounts. If you're using any mount options
> it'd be helpful to know those too.

Hi again,

This is my /etc/ocfs2/cluster.conf on every node:

cluster:
node_count = 3
name = oratest

node:
ip_port = 7777
ip_address = 192.168.100.30
number = 0
name = iscsi-teste
cluster = oratest

node:
ip_port = 7777
ip_address = 192.168.100.31
number = 1
name = orateste1
cluster = oratest

node:
ip_port = 7777
ip_address = 192.168.100.32
number = 2
name = orateste2
cluster = oratest

------------------------------------------

So today I rebooted the machines and started over and formatted the volume
again from node 0 with

iscsi-teste:~# mkfs.ocfs2 -b 4K -C 64K -N 4 -L OCFSTest1 /dev/sda
mkfs.ocfs2 1.1.5
Filesystem label=OCFSTest1
Block size=4096 (bits=12)
Cluster size=65536 (bits=16)
Volume size=2489995755520 (37994320 clusters) (607909120 blocks)
1178 cluster groups (tail covers 29008 clusters, rest cover 32256 clusters)
Journal size=33554432
Initial number of node slots: 4
WARNING: bitmap is very large, consider using a larger cluster size and/or
a smaller volume
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing lost+found: done
mkfs.ocfs2 successful

(By the way, I didn't use a bigger cluster size because we plan to use this
fs to store mainly lots of small files)

I then mounted the fs (with default options:rw,_netdev,heartbeat=local) on
node 0 and node 2 and started the usual tests by creating, writing and
copying files.
Later I also mounted the volume on node 1. By this time, node 0 and node 2
showed completely different files on the same directory. After mounting, I
could see on node 1 the files I had created from node 0 but not the ones
created from node 2. After some more file tests I unmounted on all nodes and
then remounted on all of them.

I can get an interesting result by creating several large files from *node 0*
and md5sum them; I list the same directory on node 2, the files are not
visible there; so I also create several large files from *node 2*; then I
md5sum the files I created from node 1 again and the md5sums are changed.
So I think that node 2 was allocating and writing to space that was already
allocated by node 0, overwriting files that were already there.
I can reproduce this easily with any pair of nodes. Kernel messages follow
for each node, though for most of the time there are no special error
messages. In the end there are some on node 0, when I was copying files and
extracting tarballs on all three nodes concurrently. I put the logs online to
avoid upsetting lkml.

dmesg from iscsi-teste (node 0):
http://coyote.ist.utl.pt/ocfs2/dmesg_iscsi-teste.txt

dmesg from orateste1 (node 1):
http://coyote.ist.utl.pt/ocfs2/dmesg_orateste1.txt

dmesg from orateste2 (node 2):
http://coyote.ist.utl.pt/ocfs2/dmesg_orateste2.txt

These are the logs from yesterday's tests, also with some interesting error
messages on node 2 (the only difference yesterday was that I had the SCTP
module loaded, but that doesn't seem to change the results):

http://coyote.ist.utl.pt/ocfs2/kern.log-iscsi-teste.txt
http://coyote.ist.utl.pt/ocfs2/kern.log-orateste1.txt
http://coyote.ist.utl.pt/ocfs2/kern.log-orateste2.txt

And the kernel config files:

http://coyote.ist.utl.pt/ocfs2/config-2.6.16-rc2-git3-XEON.txt
http://coyote.ist.utl.pt/ocfs2/config-2.6.16-rc2-git3-AMD.txt

If you need the files emailed to you just tell me.
Let me know if you need more tests/info.

Thanks for the help.

Best regards

Claudio

2006-02-13 15:09:06

by Nohez

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

Hello,

We are testing OCFS2 on a 2 node cluster. Both the nodes are Sun v40z
with OpenSuSE10.1 Beta3 (x86_64) installed (kernel-smp-2.6.16_rc2_git5-3)
OCFS2 version in OpenSuSE kernel is 1.1.7-SLES. OCFS2 fences the
system with the default heartbeat threshold. So heartbeat threshold
is configured to 31 on both the nodes (O2CB_HEARTBEAT_THRESHOLD=31).
Using Sun StorEdge 6920 for shared storage. Have also setup multipath
on both the nodes. Formatted a 200GB shared volume with the following
command:

mkfs.ocfs2 -b 4K -C 32K -N 4 -L Raid5 /dev/mapper/3600015d00004b200000000000000050e-part1

Performed the following 2 tests:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mounted the formatted partition on both the nodes.
sun5:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2
sun4:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2

Created a tar file on Sun5.
sun5:~ # tar cvf /raid5/test.tar /etc

File viewable on both the nodes.
sun5:~ # ls -l /raid5/test.tar
-rw-r--r-- 1 root root 21749760 Feb 13 19:57 /raid5/test.tar
sun4:~ # ls -l /raid5/test.tar
-rw-r--r-- 1 root root 21749760 Feb 13 19:57 /raid5/test.tar

Created a file on sun4 using touch.
sun4:~ # touch /raid5/abcd
touch: cannot touch `/raid5/abcd': Input/output error

OCFS2 marks the filesystem read only on sun4. Relevant messages from
/var/log/messages of sun4:

<----------------------------------------------------------->
<Sun4 /var/log/messages>
<----------------------------------------------------------->
Feb 13 19:56:18 sun4 klogd: (4200,0):o2net_set_nn_state:418 connected to node sun5 (num 0) at 196.1.1.240:7777
Feb 13 19:56:22 sun4 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 19:56:22 sun4 klogd: (4784,3):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 19:56:22 sun4 klogd: (4784,3):ocfs2_fill_local_node_info:1044 I am node 2
Feb 13 19:56:22 sun4 klogd: (4784,3):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 19:56:22 sun4 klogd: (4784,3):__dlm_print_nodes:388 node 0
Feb 13 19:56:22 sun4 klogd: (4784,3):__dlm_print_nodes:388 node 2
Feb 13 19:56:22 sun4 klogd: (4784,3):ocfs2_find_slot:267 taking node slot 1
Feb 13 19:56:22 sun4 klogd: kjournald starting. Commit interval 5 seconds
Feb 13 19:56:22 sun4 klogd: ocfs2: Mounting device (253,4) on (node 2, slot 1)
Feb 13 19:58:30 sun4 klogd: OCFS2: ERROR (device dm-4): ocfs2_search_chain: Group Descriptor # 4485090715960753726 has bad signature >>>>>>>
Feb 13 19:58:30 sun4 klogd: File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted.
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_claim_suballoc_bits:1157 ERROR: status = -5
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_claim_new_inode:1255 ERROR: status = -5
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_mknod_locked:479 ERROR: status = -5
Feb 13 19:58:30 sun4 klogd: (4890,0):ocfs2_mknod:384 ERROR: status = -5
<----------------------------------------------------------->

<----------------------------------------------------------->
<Sun5 /var/log/messages>
<----------------------------------------------------------->
Feb 13 19:56:03 sun5 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 19:56:03 sun5 klogd: (4914,2):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 19:56:03 sun5 klogd: (4914,2):ocfs2_fill_local_node_info:1044 I am node 0
Feb 13 19:56:03 sun5 klogd: (4914,2):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 19:56:03 sun5 klogd: (4914,2):__dlm_print_nodes:388 node 0
Feb 13 19:56:03 sun5 klogd: (4914,2):ocfs2_find_slot:267 taking node slot 0
Feb 13 19:56:03 sun5 klogd: kjournald starting. Commit interval 5 seconds
Feb 13 19:56:03 sun5 klogd: ocfs2: Mounting device (253,5) on (node 0, slot 0)
Feb 13 19:56:18 sun5 klogd: (4198,0):o2net_set_nn_state:418 accepted connection from node sun4 (num 2) at 196.1.1.229:7777
Feb 13 19:56:22 sun5 klogd: (4198,0):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 19:56:22 sun5 klogd: (4198,0):__dlm_print_nodes:388 node 0
Feb 13 19:56:22 sun5 klogd: (4198,0):__dlm_print_nodes:388 node 2
<----------------------------------------------------------->

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mounted the formatted partition on both the nodes.
This time mounted it first on Sun4 and then on Sun5
sun4:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2
sun5:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2

Created 2 files:
sun4:~ # touch /raid5/abcd
sun4:~ # touch /raid5/abcd1

Files viewable on both the nodes.
sun4:~ # ls -l /raid5/abcd*
-rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd
-rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd1

sun5:~ # ls -l /raid5/abcd*
-rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd
-rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd1

Issued "rm" command on Sun5.
sun5:~ # rm /raid5/abcd
sun5:~ # rm /raid5/abcd1

sun5:~ # ls -l /raid5/abcd*
/bin/ls: /raid5/abcd*: No such file or directory

This time no error message on Sun5. However files exist on Sun4

sun4:~ # ls -l /raid5/abcd*
-rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd
-rw-r--r-- 1 root root 0 Feb 13 20:20 /raid5/abcd1

Sun4 kernel reports the following:

<----------------------------------------------------------->
<Sun4 /var/log/messages>
<----------------------------------------------------------->
Feb 13 20:20:18 sun4 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 20:20:18 sun4 klogd: (4774,0):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 20:20:18 sun4 klogd: (4774,0):ocfs2_fill_local_node_info:1044 I am node 2
Feb 13 20:20:18 sun4 klogd: (4774,0):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 20:20:18 sun4 klogd: (4774,0):__dlm_print_nodes:388 node 2
Feb 13 20:20:18 sun4 klogd: (4774,0):ocfs2_find_slot:267 taking node slot 0
Feb 13 20:20:18 sun4 klogd: kjournald starting. Commit interval 5 seconds
Feb 13 20:20:18 sun4 klogd: ocfs2: Mounting device (253,3) on (node 2, slot 0)
Feb 13 20:20:28 sun4 klogd: (4235,0):o2net_set_nn_state:418 connected to node sun5 (num 0) at 196.1.1.240:7777
Feb 13 20:20:32 sun4 klogd: (4235,0):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 20:20:32 sun4 klogd: (4235,0):__dlm_print_nodes:388 node 0
Feb 13 20:20:32 sun4 klogd: (4235,0):__dlm_print_nodes:388 node 2
<----------------------------------------------------------->

<----------------------------------------------------------->
<Sun5 /var/log/messages>
<----------------------------------------------------------->
Feb 13 20:20:28 sun5 klogd: (4310,0):o2net_set_nn_state:418 accepted connection from node sun4 (num 2) at 196.1.1.229:7777
Feb 13 20:20:32 sun5 klogd: OCFS2 1.1.7-SLES Mon Jan 16 11:58:10 PST 2006 (build sles)
Feb 13 20:20:32 sun5 klogd: (5284,0):ocfs2_initialize_super:1379 max_slots for this device: 4
Feb 13 20:20:32 sun5 klogd: (5284,2):ocfs2_fill_local_node_info:1044 I am node 0
Feb 13 20:20:32 sun5 klogd: (5284,2):__dlm_print_nodes:384 Nodes in my domain ("2FC845E5792045B198E42561E9A0C405"):
Feb 13 20:20:32 sun5 klogd: (5284,2):__dlm_print_nodes:388 node 0
Feb 13 20:20:32 sun5 klogd: (5284,2):__dlm_print_nodes:388 node 2
Feb 13 20:20:32 sun5 klogd: (5284,2):ocfs2_find_slot:267 taking node slot 1
Feb 13 20:20:32 sun5 klogd: kjournald starting. Commit interval 5 seconds
Feb 13 20:20:32 sun5 klogd: ocfs2: Mounting device (253,5) on (node 0, slot 1)
<----------------------------------------------------------->

Filesystem is in read-write mode on both the nodes.

Unmounted filesystem on Sun4 and remounted.
sun4:~ # umount /raid5
sun4:~ # mount /dev/mapper/3600015d00004b200000000000000050e-part1 /raid5 -t ocfs2
sun4:~ # ls -l /raid5/abcd*
/bin/ls: /raid5/abcd*: No such file or directory

Now the filesystems are in sync on both the nodes.

~~~~~~~~~~~~~~~~~~~~~~~
/etc/cluster/ocfs2.conf (conf file same on both the servers,
sun1 mentioned in conf file is not powered on)
~~~~~~~~~~~~~~~~~~~~~~~
node:
ip_port = 7777
ip_address = 196.1.1.240
number = 0
name = sun5
cluster = ocfs2

node:
ip_port = 7777
ip_address = 196.1.1.118
number = 1
name = sun1
cluster = ocfs2

node:
ip_port = 7777
ip_address = 196.1.1.229
number = 2
name = sun4
cluster = ocfs2

cluster:
node_count = 3
name = ocfs2
~~~~~~~~~~~~~~~~~~~~~~~

Regards

Nohez

2006-02-13 22:26:32

by Mark Fasheh

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Sat, Feb 11, 2006 at 05:40:57AM +0000, Claudio Martins wrote:
> This is my /etc/ocfs2/cluster.conf on every node:

Hi Claudio,
Thanks for sending me your config files. Everything seems in order.
I was easily able to reproduce your problem on my cluster and was able to
git-bisect my way to some JBD changes which seem to be causing the issue.
Reverting those patches fixes things. Can you apply the attached patch and
confirm that it also fixes this particular problem for you? You'll have to
apply to all kernels in your cluster and either run fsck.ocfs2 or create a
new file system before testing again.

Linus, Andrew, Jan,
OCFS2 uses journal_flush() to sync metadata out to disk when another
node wants to obtain a lock on an inode which has pending journaled changes.
Something in Jan Kara's patch titled "jbd: split checkpoint lists" broke
this for OCFS2 (and I suspect for other users of JBD as well). As a result
metadata is not always completely flushed to disk by the end of the
journal_flush() call.

One easy way to reproduce is to create files from one node and list the
directory from another. Switching the listing and creating nodes around
makes things reproduce more quickly -- eventually the listing node will
start missing new files.

Unfortunately I haven't been able to figure out exactly where the bug is,
leading me to develop a patch to revert the change for now. I suppose folks
can use this as a workaround until the bug is fixed.

Jan, I'd appreciate as much input as you can give -- in the meantime OCFS2
is totally broken for the multi-node case :(
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

[PATCH] jbd: revert checkpoint list changes

This patch reverts commit f93ea411b73594f7d144855fd34278bcf34a9afc:
[PATCH] jbd: split checkpoint lists

This broke journal_flush() for OCFS2, which is its method of
being sure that metadata is sent to disk for another node.

And two related commits 8d3c7fce2d20ecc3264c8d8c91ae3beacdeaed1b and
43c3e6f5abdf6acac9b90c86bf03f995bf7d3d92 with the subjects:
[PATCH] jbd: log_do_checkpoint fix
[PATCH] jbd: remove_transaction fix

These seem to be incremental bugfixes on the original patch and as such are
no longer needed.

Signed-off-by: Mark Fasheh <[email protected]>

diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index e6265a0..014a51f 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -24,75 +24,29 @@
#include <linux/slab.h>

/*
- * Unlink a buffer from a transaction checkpoint list.
+ * Unlink a buffer from a transaction.
*
* Called with j_list_lock held.
*/

-static void __buffer_unlink_first(struct journal_head *jh)
+static inline void __buffer_unlink(struct journal_head *jh)
{
transaction_t *transaction;

transaction = jh->b_cp_transaction;
+ jh->b_cp_transaction = NULL;

jh->b_cpnext->b_cpprev = jh->b_cpprev;
jh->b_cpprev->b_cpnext = jh->b_cpnext;
- if (transaction->t_checkpoint_list == jh) {
+ if (transaction->t_checkpoint_list == jh)
transaction->t_checkpoint_list = jh->b_cpnext;
- if (transaction->t_checkpoint_list == jh)
- transaction->t_checkpoint_list = NULL;
- }
-}
-
-/*
- * Unlink a buffer from a transaction checkpoint(io) list.
- *
- * Called with j_list_lock held.
- */
-
-static inline void __buffer_unlink(struct journal_head *jh)
-{
- transaction_t *transaction;
-
- transaction = jh->b_cp_transaction;
-
- __buffer_unlink_first(jh);
- if (transaction->t_checkpoint_io_list == jh) {
- transaction->t_checkpoint_io_list = jh->b_cpnext;
- if (transaction->t_checkpoint_io_list == jh)
- transaction->t_checkpoint_io_list = NULL;
- }
-}
-
-/*
- * Move a buffer from the checkpoint list to the checkpoint io list
- *
- * Called with j_list_lock held
- */
-
-static inline void __buffer_relink_io(struct journal_head *jh)
-{
- transaction_t *transaction;
-
- transaction = jh->b_cp_transaction;
- __buffer_unlink_first(jh);
-
- if (!transaction->t_checkpoint_io_list) {
- jh->b_cpnext = jh->b_cpprev = jh;
- } else {
- jh->b_cpnext = transaction->t_checkpoint_io_list;
- jh->b_cpprev = transaction->t_checkpoint_io_list->b_cpprev;
- jh->b_cpprev->b_cpnext = jh;
- jh->b_cpnext->b_cpprev = jh;
- }
- transaction->t_checkpoint_io_list = jh;
+ if (transaction->t_checkpoint_list == jh)
+ transaction->t_checkpoint_list = NULL;
}

/*
* Try to release a checkpointed buffer from its transaction.
- * Returns 1 if we released it and 2 if we also released the
- * whole transaction.
- *
+ * Returns 1 if we released it.
* Requires j_list_lock
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
*/
@@ -103,11 +57,12 @@ static int __try_to_free_cp_buf(struct j

if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
- ret = __journal_remove_checkpoint(jh) + 1;
+ __journal_remove_checkpoint(jh);
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
BUFFER_TRACE(bh, "release");
__brelse(bh);
+ ret = 1;
} else {
jbd_unlock_bh_state(bh);
}
@@ -162,53 +117,83 @@ static void jbd_sync_bh(journal_t *journ
}

/*
- * Clean up transaction's list of buffers submitted for io.
- * We wait for any pending IO to complete and remove any clean
- * buffers. Note that we take the buffers in the opposite ordering
- * from the one in which they were submitted for IO.
+ * Clean up a transaction's checkpoint list.
+ *
+ * We wait for any pending IO to complete and make sure any clean
+ * buffers are removed from the transaction.
+ *
+ * Return 1 if we performed any actions which might have destroyed the
+ * checkpoint. (journal_remove_checkpoint() deletes the transaction when
+ * the last checkpoint buffer is cleansed)
*
* Called with j_list_lock held.
*/
-
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __cleanup_transaction(journal_t *journal, transaction_t *transaction)
{
- struct journal_head *jh;
+ struct journal_head *jh, *next_jh, *last_jh;
struct buffer_head *bh;
- tid_t this_tid;
- int released = 0;
+ int ret = 0;
+
+ assert_spin_locked(&journal->j_list_lock);
+ jh = transaction->t_checkpoint_list;
+ if (!jh)
+ return 0;

- this_tid = transaction->t_tid;
-restart:
- /* Didn't somebody clean up the transaction in the meanwhile */
- if (journal->j_checkpoint_transactions != transaction ||
- transaction->t_tid != this_tid)
- return;
- while (!released && transaction->t_checkpoint_io_list) {
- jh = transaction->t_checkpoint_io_list;
+ last_jh = jh->b_cpprev;
+ next_jh = jh;
+ do {
+ jh = next_jh;
bh = jh2bh(jh);
- if (!jbd_trylock_bh_state(bh)) {
- jbd_sync_bh(journal, bh);
- spin_lock(&journal->j_list_lock);
- goto restart;
- }
if (buffer_locked(bh)) {
atomic_inc(&bh->b_count);
spin_unlock(&journal->j_list_lock);
- jbd_unlock_bh_state(bh);
wait_on_buffer(bh);
/* the journal_head may have gone by now */
BUFFER_TRACE(bh, "brelse");
__brelse(bh);
- spin_lock(&journal->j_list_lock);
- goto restart;
+ goto out_return_1;
}
+
/*
- * Now in whatever state the buffer currently is, we know that
- * it has been written out and so we can drop it from the list
+ * This is foul
*/
- released = __journal_remove_checkpoint(jh);
- jbd_unlock_bh_state(bh);
- }
+ if (!jbd_trylock_bh_state(bh)) {
+ jbd_sync_bh(journal, bh);
+ goto out_return_1;
+ }
+
+ if (jh->b_transaction != NULL) {
+ transaction_t *t = jh->b_transaction;
+ tid_t tid = t->t_tid;
+
+ spin_unlock(&journal->j_list_lock);
+ jbd_unlock_bh_state(bh);
+ log_start_commit(journal, tid);
+ log_wait_commit(journal, tid);
+ goto out_return_1;
+ }
+
+ /*
+ * AKPM: I think the buffer_jbddirty test is redundant - it
+ * shouldn't have NULL b_transaction?
+ */
+ next_jh = jh->b_cpnext;
+ if (!buffer_dirty(bh) && !buffer_jbddirty(bh)) {
+ BUFFER_TRACE(bh, "remove from checkpoint");
+ __journal_remove_checkpoint(jh);
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ __brelse(bh);
+ ret = 1;
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ } while (jh != last_jh);
+
+ return ret;
+out_return_1:
+ spin_lock(&journal->j_list_lock);
+ return 1;
}

#define NR_BATCH 64
@@ -218,7 +203,9 @@ __flush_batch(journal_t *journal, struct
{
int i;

+ spin_unlock(&journal->j_list_lock);
ll_rw_block(SWRITE, *batch_count, bhs);
+ spin_lock(&journal->j_list_lock);
for (i = 0; i < *batch_count; i++) {
struct buffer_head *bh = bhs[i];
clear_buffer_jwrite(bh);
@@ -234,46 +221,19 @@ __flush_batch(journal_t *journal, struct
* Return 1 if something happened which requires us to abort the current
* scan of the checkpoint list.
*
- * Called with j_list_lock held and drops it if 1 is returned
+ * Called with j_list_lock held.
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
*/
-static int __process_buffer(journal_t *journal, struct journal_head *jh,
- struct buffer_head **bhs, int *batch_count)
+static int __flush_buffer(journal_t *journal, struct journal_head *jh,
+ struct buffer_head **bhs, int *batch_count,
+ int *drop_count)
{
struct buffer_head *bh = jh2bh(jh);
int ret = 0;

- if (buffer_locked(bh)) {
- get_bh(bh);
- spin_unlock(&journal->j_list_lock);
- jbd_unlock_bh_state(bh);
- wait_on_buffer(bh);
- /* the journal_head may have gone by now */
- BUFFER_TRACE(bh, "brelse");
- put_bh(bh);
- ret = 1;
- }
- else if (jh->b_transaction != NULL) {
- transaction_t *t = jh->b_transaction;
- tid_t tid = t->t_tid;
+ if (buffer_dirty(bh) && !buffer_locked(bh) && jh->b_jlist == BJ_None) {
+ J_ASSERT_JH(jh, jh->b_transaction == NULL);

- spin_unlock(&journal->j_list_lock);
- jbd_unlock_bh_state(bh);
- log_start_commit(journal, tid);
- log_wait_commit(journal, tid);
- ret = 1;
- }
- else if (!buffer_dirty(bh)) {
- J_ASSERT_JH(jh, !buffer_jbddirty(bh));
- BUFFER_TRACE(bh, "remove from checkpoint");
- __journal_remove_checkpoint(jh);
- spin_unlock(&journal->j_list_lock);
- jbd_unlock_bh_state(bh);
- journal_remove_journal_head(bh);
- put_bh(bh);
- ret = 1;
- }
- else {
/*
* Important: we are about to write the buffer, and
* possibly block, while still holding the journal lock.
@@ -286,30 +246,45 @@ static int __process_buffer(journal_t *j
J_ASSERT_BH(bh, !buffer_jwrite(bh));
set_buffer_jwrite(bh);
bhs[*batch_count] = bh;
- __buffer_relink_io(jh);
jbd_unlock_bh_state(bh);
(*batch_count)++;
if (*batch_count == NR_BATCH) {
- spin_unlock(&journal->j_list_lock);
__flush_batch(journal, bhs, batch_count);
ret = 1;
}
+ } else {
+ int last_buffer = 0;
+ if (jh->b_cpnext == jh) {
+ /* We may be about to drop the transaction. Tell the
+ * caller that the lists have changed.
+ */
+ last_buffer = 1;
+ }
+ if (__try_to_free_cp_buf(jh)) {
+ (*drop_count)++;
+ ret = last_buffer;
+ }
}
return ret;
}

/*
- * Perform an actual checkpoint. We take the first transaction on the
- * list of transactions to be checkpointed and send all its buffers
- * to disk. We submit larger chunks of data at once.
+ * Perform an actual checkpoint. We don't write out only enough to
+ * satisfy the current blocked requests: rather we submit a reasonably
+ * sized chunk of the outstanding data to disk at once for
+ * efficiency. __log_wait_for_space() will retry if we didn't free enough.
*
+ * However, we _do_ take into account the amount requested so that once
+ * the IO has been queued, we can return as soon as enough of it has
+ * completed to disk.
+ *
* The journal should be locked before calling this function.
*/
int log_do_checkpoint(journal_t *journal)
{
- transaction_t *transaction;
- tid_t this_tid;
int result;
+ int batch_count = 0;
+ struct buffer_head *bhs[NR_BATCH];

jbd_debug(1, "Start checkpoint\n");

@@ -324,70 +299,79 @@ int log_do_checkpoint(journal_t *journal
return result;

/*
- * OK, we need to start writing disk blocks. Take one transaction
- * and write it.
+ * OK, we need to start writing disk blocks. Try to free up a
+ * quarter of the log in a single checkpoint if we can.
*/
- spin_lock(&journal->j_list_lock);
- if (!journal->j_checkpoint_transactions)
- goto out;
- transaction = journal->j_checkpoint_transactions;
- this_tid = transaction->t_tid;
-restart:
/*
- * If someone cleaned up this transaction while we slept, we're
- * done (maybe it's a new transaction, but it fell at the same
- * address).
+ * AKPM: check this code. I had a feeling a while back that it
+ * degenerates into a busy loop at unmount time.
*/
- if (journal->j_checkpoint_transactions == transaction &&
- transaction->t_tid == this_tid) {
- int batch_count = 0;
- struct buffer_head *bhs[NR_BATCH];
- struct journal_head *jh;
- int retry = 0;
-
- while (!retry && transaction->t_checkpoint_list) {
+ spin_lock(&journal->j_list_lock);
+ while (journal->j_checkpoint_transactions) {
+ transaction_t *transaction;
+ struct journal_head *jh, *last_jh, *next_jh;
+ int drop_count = 0;
+ int cleanup_ret, retry = 0;
+ tid_t this_tid;
+
+ transaction = journal->j_checkpoint_transactions;
+ this_tid = transaction->t_tid;
+ jh = transaction->t_checkpoint_list;
+ last_jh = jh->b_cpprev;
+ next_jh = jh;
+ do {
struct buffer_head *bh;

- jh = transaction->t_checkpoint_list;
+ jh = next_jh;
+ next_jh = jh->b_cpnext;
bh = jh2bh(jh);
if (!jbd_trylock_bh_state(bh)) {
jbd_sync_bh(journal, bh);
+ spin_lock(&journal->j_list_lock);
retry = 1;
break;
}
- retry = __process_buffer(journal, jh, bhs,
- &batch_count);
- if (!retry &&
- lock_need_resched(&journal->j_list_lock)) {
- spin_unlock(&journal->j_list_lock);
+ retry = __flush_buffer(journal, jh, bhs, &batch_count, &drop_count);
+ if (cond_resched_lock(&journal->j_list_lock)) {
retry = 1;
break;
}
- }
+ } while (jh != last_jh && !retry);

if (batch_count) {
- if (!retry) {
- spin_unlock(&journal->j_list_lock);
- retry = 1;
- }
__flush_batch(journal, bhs, &batch_count);
+ retry = 1;
}

- if (retry) {
- spin_lock(&journal->j_list_lock);
- goto restart;
- }
/*
- * Now we have cleaned up the first transaction's checkpoint
- * list. Let's clean up the second one.
+ * If someone cleaned up this transaction while we slept, we're
+ * done
+ */
+ if (journal->j_checkpoint_transactions != transaction)
+ break;
+ if (retry)
+ continue;
+ /*
+ * Maybe it's a new transaction, but it fell at the same
+ * address
*/
- __wait_cp_io(journal, transaction);
+ if (transaction->t_tid != this_tid)
+ continue;
+ /*
+ * We have walked the whole transaction list without
+ * finding anything to write to disk. We had better be
+ * able to make some progress or we are in trouble.
+ */
+ cleanup_ret = __cleanup_transaction(journal, transaction);
+ J_ASSERT(drop_count != 0 || cleanup_ret != 0);
+ if (journal->j_checkpoint_transactions != transaction)
+ break;
}
-out:
spin_unlock(&journal->j_list_lock);
result = cleanup_journal_tail(journal);
if (result < 0)
return result;
+
return 0;
}

@@ -472,91 +456,52 @@ int cleanup_journal_tail(journal_t *jour
/* Checkpoint list management */

/*
- * journal_clean_one_cp_list
- *
- * Find all the written-back checkpoint buffers in the given list and release them.
- *
- * Called with the journal locked.
- * Called with j_list_lock held.
- * Returns number of bufers reaped (for debug)
- */
-
-static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
-{
- struct journal_head *last_jh;
- struct journal_head *next_jh = jh;
- int ret, freed = 0;
-
- *released = 0;
- if (!jh)
- return 0;
-
- last_jh = jh->b_cpprev;
- do {
- jh = next_jh;
- next_jh = jh->b_cpnext;
- /* Use trylock because of the ranking */
- if (jbd_trylock_bh_state(jh2bh(jh))) {
- ret = __try_to_free_cp_buf(jh);
- if (ret) {
- freed++;
- if (ret == 2) {
- *released = 1;
- return freed;
- }
- }
- }
- /*
- * This function only frees up some memory if possible so we
- * dont have an obligation to finish processing. Bail out if
- * preemption requested:
- */
- if (need_resched())
- return freed;
- } while (jh != last_jh);
-
- return freed;
-}
-
-/*
* journal_clean_checkpoint_list
*
* Find all the written-back checkpoint buffers in the journal and release them.
*
* Called with the journal locked.
* Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
+ * Returns number of bufers reaped (for debug)
*/

int __journal_clean_checkpoint_list(journal_t *journal)
{
transaction_t *transaction, *last_transaction, *next_transaction;
- int ret = 0, released;
+ int ret = 0;

transaction = journal->j_checkpoint_transactions;
- if (!transaction)
+ if (transaction == 0)
goto out;

last_transaction = transaction->t_cpprev;
next_transaction = transaction;
do {
+ struct journal_head *jh;
+
transaction = next_transaction;
next_transaction = transaction->t_cpnext;
- ret += journal_clean_one_cp_list(transaction->
- t_checkpoint_list, &released);
- if (need_resched())
- goto out;
- if (released)
- continue;
- /*
- * It is essential that we are as careful as in the case of
- * t_checkpoint_list with removing the buffer from the list as
- * we can possibly see not yet submitted buffers on io_list
- */
- ret += journal_clean_one_cp_list(transaction->
- t_checkpoint_io_list, &released);
- if (need_resched())
- goto out;
+ jh = transaction->t_checkpoint_list;
+ if (jh) {
+ struct journal_head *last_jh = jh->b_cpprev;
+ struct journal_head *next_jh = jh;
+
+ do {
+ jh = next_jh;
+ next_jh = jh->b_cpnext;
+ /* Use trylock because of the ranknig */
+ if (jbd_trylock_bh_state(jh2bh(jh)))
+ ret += __try_to_free_cp_buf(jh);
+ /*
+ * This function only frees up some memory
+ * if possible so we dont have an obligation
+ * to finish processing. Bail out if preemption
+ * requested:
+ */
+ if (need_resched())
+ goto out;
+ } while (jh != last_jh);
+ }
} while (transaction != last_transaction);
out:
return ret;
@@ -571,22 +516,18 @@ out:
* buffer updates committed in that transaction have safely been stored
* elsewhere on disk. To achieve this, all of the buffers in a
* transaction need to be maintained on the transaction's checkpoint
- * lists until they have been rewritten, at which point this function is
+ * list until they have been rewritten, at which point this function is
* called to remove the buffer from the existing transaction's
- * checkpoint lists.
- *
- * The function returns 1 if it frees the transaction, 0 otherwise.
+ * checkpoint list.
*
* This function is called with the journal locked.
* This function is called with j_list_lock held.
- * This function is called with jbd_lock_bh_state(jh2bh(jh))
*/

-int __journal_remove_checkpoint(struct journal_head *jh)
+void __journal_remove_checkpoint(struct journal_head *jh)
{
transaction_t *transaction;
journal_t *journal;
- int ret = 0;

JBUFFER_TRACE(jh, "entry");

@@ -597,10 +538,8 @@ int __journal_remove_checkpoint(struct j
journal = transaction->t_journal;

__buffer_unlink(jh);
- jh->b_cp_transaction = NULL;

- if (transaction->t_checkpoint_list != NULL ||
- transaction->t_checkpoint_io_list != NULL)
+ if (transaction->t_checkpoint_list != NULL)
goto out;
JBUFFER_TRACE(jh, "transaction has no more buffers");

@@ -626,10 +565,8 @@ int __journal_remove_checkpoint(struct j
/* Just in case anybody was waiting for more transactions to be
checkpointed... */
wake_up(&journal->j_wait_logspace);
- ret = 1;
out:
JBUFFER_TRACE(jh, "exit");
- return ret;
}

/*
@@ -691,7 +628,6 @@ void __journal_drop_transaction(journal_
J_ASSERT(transaction->t_shadow_list == NULL);
J_ASSERT(transaction->t_log_list == NULL);
J_ASSERT(transaction->t_checkpoint_list == NULL);
- J_ASSERT(transaction->t_checkpoint_io_list == NULL);
J_ASSERT(transaction->t_updates == 0);
J_ASSERT(journal->j_committing_transaction != transaction);
J_ASSERT(journal->j_running_transaction != transaction);
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 29e62d9..002ad2b 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -829,8 +829,7 @@ restart_loop:
journal->j_committing_transaction = NULL;
spin_unlock(&journal->j_state_lock);

- if (commit_transaction->t_checkpoint_list == NULL &&
- commit_transaction->t_checkpoint_io_list == NULL) {
+ if (commit_transaction->t_checkpoint_list == NULL) {
__journal_drop_transaction(journal, commit_transaction);
} else {
if (journal->j_checkpoint_transactions == NULL) {
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 0fe4aa8..41ee799 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -498,12 +498,6 @@ struct transaction_s
struct journal_head *t_checkpoint_list;

/*
- * Doubly-linked circular list of all buffers submitted for IO while
- * checkpointing. [j_list_lock]
- */
- struct journal_head *t_checkpoint_io_list;
-
- /*
* Doubly-linked circular list of temporary buffers currently undergoing
* IO in the log [j_list_lock]
*/
@@ -852,7 +846,7 @@ extern void journal_commit_transaction(j

/* Checkpoint list management */
int __journal_clean_checkpoint_list(journal_t *journal);
-int __journal_remove_checkpoint(struct journal_head *);
+void __journal_remove_checkpoint(struct journal_head *);
void __journal_insert_checkpoint(struct journal_head *, transaction_t *);

/* Buffer IO */

2006-02-14 00:00:58

by Claudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Monday 13 February 2006 22:26, Mark Fasheh wrote:
> On Sat, Feb 11, 2006 at 05:40:57AM +0000, Claudio Martins wrote:
> > This is my /etc/ocfs2/cluster.conf on every node:
>
> Hi Claudio,
> Thanks for sending me your config files. Everything seems in order.
> I was easily able to reproduce your problem on my cluster and was able to
> git-bisect my way to some JBD changes which seem to be causing the issue.
> Reverting those patches fixes things. Can you apply the attached patch and
> confirm that it also fixes this particular problem for you? You'll have to
> apply to all kernels in your cluster and either run fsck.ocfs2 or create a
> new file system before testing again.

Hi Mark,

I'll apply the patch and rebuild my kernels and filesystem. Will be reporting
the results ASAP.

Thanks

Claudio

>
> Linus, Andrew, Jan,
> OCFS2 uses journal_flush() to sync metadata out to disk when another
> node wants to obtain a lock on an inode which has pending journaled
> changes. Something in Jan Kara's patch titled "jbd: split checkpoint lists"
> broke this for OCFS2 (and I suspect for other users of JBD as well). As a
> result metadata is not always completely flushed to disk by the end of the
> journal_flush() call.

2006-02-14 06:16:25

by Claudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Monday 13 February 2006 22:26, Mark Fasheh wrote:
> I was easily able to reproduce your problem on my cluster and was able to
> git-bisect my way to some JBD changes which seem to be causing the issue.
> Reverting those patches fixes things. Can you apply the attached patch and
> confirm that it also fixes this particular problem for you? You'll have to
> apply to all kernels in your cluster and either run fsck.ocfs2 or create a
> new file system before testing again.
>

Hi Mark,

This patch does indeed seem to fix this particular problem. Now creating and
deleting files/directories gives expected results across nodes.

The bad news is that it didn't last long. While doing some more tests I found
another problem, but judging from kernel messages I think this one is related
to the DLM code.

The test was simple:

On node 1 untar a kernel tree and wait for tar to finish.

After tar finished I ran tar on nodes 0 and 2, each one *concurrently*
creating a separate archive from *the same* kernel tree untarred on node 1.

Again, since the files are big, I've put them online:

Node0:
http://coyote.ist.utl.pt/ocfs2/Feb14/kern-iscsi-teste.log

Node1:
http://coyote.ist.utl.pt/ocfs2/Feb14/kern-orateste1.log
(this node's clock was 10 minutes off, sorry about that)

Node2:
http://coyote.ist.utl.pt/ocfs2/Feb14/kern-orateste2.log

On node 0, tar exited with:
tar:
build-AMD-linux-2.6.16-rc2-git3-jbdfix1/drivers/media/video/cx25840/cx25840-core.c:
Cannot stat: Invalid argument

On node 2, tar exited with a segmentation fault.

Anyway, after that I am still able to read and write files on all three nodes
with consistency.

Any ideas?

Thanks
Best regards

Claudio

2006-02-14 12:17:04

by Jan Kara

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

> On Sat, Feb 11, 2006 at 05:40:57AM +0000, Claudio Martins wrote:
> > This is my /etc/ocfs2/cluster.conf on every node:
>
> Hi Claudio,
> Thanks for sending me your config files. Everything seems in order.
> I was easily able to reproduce your problem on my cluster and was able to
> git-bisect my way to some JBD changes which seem to be causing the issue.
> Reverting those patches fixes things. Can you apply the attached patch and
> confirm that it also fixes this particular problem for you? You'll have to
> apply to all kernels in your cluster and either run fsck.ocfs2 or create a
> new file system before testing again.
>
> Linus, Andrew, Jan,
> OCFS2 uses journal_flush() to sync metadata out to disk when another
> node wants to obtain a lock on an inode which has pending journaled changes.
> Something in Jan Kara's patch titled "jbd: split checkpoint lists" broke
> this for OCFS2 (and I suspect for other users of JBD as well). As a result
> metadata is not always completely flushed to disk by the end of the
> journal_flush() call.
>
> One easy way to reproduce is to create files from one node and list the
> directory from another. Switching the listing and creating nodes around
> makes things reproduce more quickly -- eventually the listing node will
> start missing new files.
Ok, I'll have a look at the problem. Probably something in
log_do_checkpoint() is not waiting for all the data or something like
that. I'll try to reproduce with ext3 - it uses journal_flush() in
ext3_bmap() so if journal_flush() is not flushing all the data we should
be able to see that... Thanks for spotting the problem.

Bye
Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2006-02-14 20:20:16

by Mark Fasheh

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Tue, Feb 14, 2006 at 06:16:11AM +0000, Claudio Martins wrote:
> This patch does indeed seem to fix this particular problem. Now creating and
> deleting files/directories gives expected results across nodes.
Great!

> The bad news is that it didn't last long. While doing some more tests I found
> another problem, but judging from kernel messages I think this one is related
> to the DLM code.
Not so great :/

We have some dlm fixes in our git tree which haven't made their way to Linus
yet (I wanted to run a few more tests). Would you be interested in patching
with them so we can see which bugs are left? The easiest way to get this is
to pull them out of 2.6.16-rc3-mm1:

http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc3/2.6.16-rc3-mm1/broken-out/git-ocfs2.patch

You'll also pull a performance fix in that patch. Of course, even easier
might be to just run 2.6.16-rc3-mm1 for now as it also contains the
checkpoint list split backout.

> On node 0, tar exited with:
> tar:
> build-AMD-linux-2.6.16-rc2-git3-jbdfix1/drivers/media/video/cx25840/cx25840-core.c:
> Cannot stat: Invalid argument
>
> On node 2, tar exited with a segmentation fault.
Could you load up debugfs.ocfs2 against your file system and run the
following command:

debugfs: locate <M0000000000000000d4a94b05ae097f>

It will tell me the path to the file which that metadata lock refers to.
The path may help us figure out what sort of access we're having problems on
here.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-02-15 03:18:06

by Claudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Tuesday 14 February 2006 20:19, Mark Fasheh wrote:
>
> We have some dlm fixes in our git tree which haven't made their way to
> Linus yet (I wanted to run a few more tests). Would you be interested in
> patching with them so we can see which bugs are left? The easiest way to
> get this is to pull them out of 2.6.16-rc3-mm1:
>
> http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc3/2.6.1
>6-rc3-mm1/broken-out/git-ocfs2.patch
>

OK, will do.

>
> Could you load up debugfs.ocfs2 against your file system and run the
> following command:
>
> debugfs: locate <M0000000000000000d4a94b05ae097f>
>
> It will tell me the path to the file which that metadata lock refers to.
> The path may help us figure out what sort of access we're having problems
> on here.

debugfs.ocfs2 1.1.5
debugfs: locate <M0000000000000000d4a94b05ae097f>
13936971 /ctpm/dir2/build-AMD-linux-2.6.16-rc2-git3-jbdfix1/drivers/media/video/cx25840/cx25840-core.c

This is a file from the kernel build tree I untarred on node 1. So I suppose
the other two nodes eventually were trying to read it.

Regards

Claudio

2006-02-15 17:50:57

by Nohez

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

Applied both the patches - jbd-undo.patch and the dlm.patch
from mm1 kernel to the recently released OpenSuSE kernel
(kernel-smp-2.6.16_rc3-2). Recreated the filesystem using mkfs.ocfs2.
Tried the two tests that I had reported in my previous email. Both of
them run ok now.

Now started bonnie++ on one node while the same volume on the other node
is mounted and is in quiescent stage. Its been more than 5 hours now
and both the nodes are up & running. Seeing some very high load of >8 at
times on the node running bonnie++. Test node has 4GB RAM and bonnie++
is creating files of 8GB to test IO performance. Will start bonnie++
on both the nodes concurrently later.

2006-02-15 21:39:59

by Cláudio Martins

[permalink] [raw]

Subject: Re: OCFS2 Filesystem inconsistency across nodes

On Wednesday 15 February 2006 17:50, Nohez wrote:
>
> Now started bonnie++ on one node while the same volume on the other node
> is mounted and is in quiescent stage. Its been more than 5 hours now
> and both the nodes are up & running. Seeing some very high load of >8 at
> times on the node running bonnie++. Test node has 4GB RAM and bonnie++
> is creating files of 8GB to test IO performance. Will start bonnie++
> on both the nodes concurrently later.
>

Nice. It'll be interesting when you start concurrent tests. I think the last
fixes from -mm made it better, but I'm still getting DLM errors and processes
hung in D state when using tar concurrently reading kernel trees on 3 nodes.
I'd like to know if you'll get the same or not.
I'll be sending kernel messages and a more complete report later this
evening.

Regards

Claudio