2023-08-09 22:37:07

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

Copy and paste the commit message from Darrick into a comment to explain
the seemly odd invalidate_bdev in xfs_shutdown_devices.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4ae3b01ed038c7..c169beb0d8cab3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -399,6 +399,32 @@ STATIC void
xfs_shutdown_devices(
struct xfs_mount *mp)
{
+ /*
+ * Udev is triggered whenever anyone closes a block device or unmounts
+ * a file systemm on a block device.
+ * The default udev rules invoke blkid to read the fs super and create
+ * symlinks to the bdev under /dev/disk. For this, it uses buffered
+ * reads through the page cache.
+ *
+ * xfs_db also uses buffered reads to examine metadata. There is no
+ * coordination between xfs_db and udev, which means that they can run
+ * concurrently. Note there is no coordination between the kernel and
+ * blkid either.
+ *
+ * On a system with 64k pages, the page cache can cache the superblock
+ * and the root inode (and hence the root directory) with the same 64k
+ * page. If udev spawns blkid after the mkfs and the system is busy
+ * enough that it is still running when xfs_db starts up, they'll both
+ * read from the same page in the pagecache.
+ *
+ * The unmount writes updated inode metadata to disk directly. The XFS
+ * buffer cache does not use the bdev pagecache, nor does it invalidate
+ * the pagecache on umount. If the above scenario occurs, the pagecache
+ * no longer reflects what's on disk, xfs_db reads the stale metadata,
+ * and fails to find /a. Most of the time this succeeds because closing
+ * a bdev invalidates the page cache, but when processes race, everyone
+ * loses.
+ */
if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
invalidate_bdev(mp->m_logdev_targp->bt_bdev);
--
2.39.2



2023-08-09 23:12:11

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> Copy and paste the commit message from Darrick into a comment to explain
> the seemly odd invalidate_bdev in xfs_shutdown_devices.

^ seemingly?
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 4ae3b01ed038c7..c169beb0d8cab3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -399,6 +399,32 @@ STATIC void
> xfs_shutdown_devices(
> struct xfs_mount *mp)
> {
> + /*
> + * Udev is triggered whenever anyone closes a block device or unmounts
> + * a file systemm on a block device.
> + * The default udev rules invoke blkid to read the fs super and create
> + * symlinks to the bdev under /dev/disk. For this, it uses buffered
> + * reads through the page cache.
> + *
> + * xfs_db also uses buffered reads to examine metadata. There is no
> + * coordination between xfs_db and udev, which means that they can run
> + * concurrently. Note there is no coordination between the kernel and
> + * blkid either.
> + *
> + * On a system with 64k pages, the page cache can cache the superblock
> + * and the root inode (and hence the root directory) with the same 64k
> + * page. If udev spawns blkid after the mkfs and the system is busy
> + * enough that it is still running when xfs_db starts up, they'll both
> + * read from the same page in the pagecache.
> + *
> + * The unmount writes updated inode metadata to disk directly. The XFS
> + * buffer cache does not use the bdev pagecache, nor does it invalidate
> + * the pagecache on umount. If the above scenario occurs, the pagecache

This sentence reads a little strangely, since "nor does it invalidate"
would seem to conflict with the invalidate_bdev call below. I suggest
changing the verb a bit:

"The XFS buffer cache does not use the bdev pagecache, so it needs to
invalidate that pagecache on unmount."

With those two things changed,
Reviewed-by: Darrick J. Wong <[email protected]>

--D


> + * no longer reflects what's on disk, xfs_db reads the stale metadata,
> + * and fails to find /a. Most of the time this succeeds because closing
> + * a bdev invalidates the page cache, but when processes race, everyone
> + * loses.
> + */
> if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
> blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> invalidate_bdev(mp->m_logdev_targp->bt_bdev);
> --
> 2.39.2
>

2023-08-10 08:51:30

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Wed, Aug 09, 2023 at 03:39:23PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> > Copy and paste the commit message from Darrick into a comment to explain
> > the seemly odd invalidate_bdev in xfs_shutdown_devices.
>
> ^ seemingly?
> >
> > Signed-off-by: Christoph Hellwig <[email protected]>
> > ---
> > fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++
> > 1 file changed, 26 insertions(+)
> >
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 4ae3b01ed038c7..c169beb0d8cab3 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -399,6 +399,32 @@ STATIC void
> > xfs_shutdown_devices(
> > struct xfs_mount *mp)
> > {
> > + /*
> > + * Udev is triggered whenever anyone closes a block device or unmounts
> > + * a file systemm on a block device.
> > + * The default udev rules invoke blkid to read the fs super and create
> > + * symlinks to the bdev under /dev/disk. For this, it uses buffered
> > + * reads through the page cache.
> > + *
> > + * xfs_db also uses buffered reads to examine metadata. There is no
> > + * coordination between xfs_db and udev, which means that they can run
> > + * concurrently. Note there is no coordination between the kernel and
> > + * blkid either.
> > + *
> > + * On a system with 64k pages, the page cache can cache the superblock
> > + * and the root inode (and hence the root directory) with the same 64k
> > + * page. If udev spawns blkid after the mkfs and the system is busy
> > + * enough that it is still running when xfs_db starts up, they'll both
> > + * read from the same page in the pagecache.
> > + *
> > + * The unmount writes updated inode metadata to disk directly. The XFS
> > + * buffer cache does not use the bdev pagecache, nor does it invalidate
> > + * the pagecache on umount. If the above scenario occurs, the pagecache
>
> This sentence reads a little strangely, since "nor does it invalidate"
> would seem to conflict with the invalidate_bdev call below. I suggest
> changing the verb a bit:
>
> "The XFS buffer cache does not use the bdev pagecache, so it needs to
> invalidate that pagecache on unmount."
>
> With those two things changed,
> Reviewed-by: Darrick J. Wong <[email protected]>

Fixed in-tree.

2023-08-10 15:49:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> + /*
> + * Udev is triggered whenever anyone closes a block device or unmounts
> + * a file systemm on a block device.
> + * The default udev rules invoke blkid to read the fs super and create
> + * symlinks to the bdev under /dev/disk. For this, it uses buffered
> + * reads through the page cache.
> + *
> + * xfs_db also uses buffered reads to examine metadata. There is no
> + * coordination between xfs_db and udev, which means that they can run
> + * concurrently. Note there is no coordination between the kernel and
> + * blkid either.
> + *
> + * On a system with 64k pages, the page cache can cache the superblock
> + * and the root inode (and hence the root directory) with the same 64k
> + * page. If udev spawns blkid after the mkfs and the system is busy
> + * enough that it is still running when xfs_db starts up, they'll both
> + * read from the same page in the pagecache.
> + *
> + * The unmount writes updated inode metadata to disk directly. The XFS
> + * buffer cache does not use the bdev pagecache, nor does it invalidate
> + * the pagecache on umount. If the above scenario occurs, the pagecache
> + * no longer reflects what's on disk, xfs_db reads the stale metadata,
> + * and fails to find /a. Most of the time this succeeds because closing
> + * a bdev invalidates the page cache, but when processes race, everyone
> + * loses.
> + */
> if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
> blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> invalidate_bdev(mp->m_logdev_targp->bt_bdev);

While I have no complaints with this as a commit message, it's just too
verbose for an inline comment, IMO. Something pithier and more generic
would seem appropriate. How about:

/*
* Prevent userspace (eg blkid or xfs_db) from seeing stale data.
* XFS is not coherent with the bdev's page cache.
*/

2023-08-10 16:01:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote:
> > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> > invalidate_bdev(mp->m_logdev_targp->bt_bdev);
>
> While I have no complaints with this as a commit message, it's just too
> verbose for an inline comment, IMO. Something pithier and more generic
> would seem appropriate. How about:
>
> /*
> * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
> * XFS is not coherent with the bdev's page cache.
> */

Well, this completely misses the point. The point is that XFS should
never have to invalidate the page cache because it's not using it,
but it has to due to weird races. I tried to condese the message but
I could not come up with a good one that's not losing information.

2023-08-10 16:02:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Wed, Aug 09, 2023 at 03:39:23PM -0700, Darrick J. Wong wrote:
> > + * read from the same page in the pagecache.
> > + *
> > + * The unmount writes updated inode metadata to disk directly. The XFS
> > + * buffer cache does not use the bdev pagecache, nor does it invalidate
> > + * the pagecache on umount. If the above scenario occurs, the pagecache
>
> This sentence reads a little strangely, since "nor does it invalidate"
> would seem to conflict with the invalidate_bdev call below. I suggest
> changing the verb a bit:
>
> "The XFS buffer cache does not use the bdev pagecache, so it needs to
> invalidate that pagecache on unmount."

Agreed. I'll forward it to the original author of the sentence time
permitting :)


2023-08-10 16:04:31

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote:
> > + /*
> > + * Udev is triggered whenever anyone closes a block device or unmounts
> > + * a file systemm on a block device.
> > + * The default udev rules invoke blkid to read the fs super and create
> > + * symlinks to the bdev under /dev/disk. For this, it uses buffered
> > + * reads through the page cache.
> > + *
> > + * xfs_db also uses buffered reads to examine metadata. There is no
> > + * coordination between xfs_db and udev, which means that they can run
> > + * concurrently. Note there is no coordination between the kernel and
> > + * blkid either.
> > + *
> > + * On a system with 64k pages, the page cache can cache the superblock
> > + * and the root inode (and hence the root directory) with the same 64k
> > + * page. If udev spawns blkid after the mkfs and the system is busy
> > + * enough that it is still running when xfs_db starts up, they'll both
> > + * read from the same page in the pagecache.
> > + *
> > + * The unmount writes updated inode metadata to disk directly. The XFS
> > + * buffer cache does not use the bdev pagecache, nor does it invalidate
> > + * the pagecache on umount. If the above scenario occurs, the pagecache
> > + * no longer reflects what's on disk, xfs_db reads the stale metadata,
> > + * and fails to find /a. Most of the time this succeeds because closing
> > + * a bdev invalidates the page cache, but when processes race, everyone
> > + * loses.
> > + */
> > if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) {
> > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> > invalidate_bdev(mp->m_logdev_targp->bt_bdev);
>
> While I have no complaints with this as a commit message, it's just too
> verbose for an inline comment, IMO. Something pithier and more generic
> would seem appropriate. How about:
>
> /*
> * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
> * XFS is not coherent with the bdev's page cache.

"XFS' buffer cache is not coherent with the bdev's page cache."
?

--D

> */

2023-08-10 16:05:01

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 07/13] xfs: document the invalidate_bdev call in invalidate_bdev

On Thu, Aug 10, 2023 at 05:52:25PM +0200, Christoph Hellwig wrote:
> On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote:
> > > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev);
> > > invalidate_bdev(mp->m_logdev_targp->bt_bdev);
> >
> > While I have no complaints with this as a commit message, it's just too
> > verbose for an inline comment, IMO. Something pithier and more generic
> > would seem appropriate. How about:
> >
> > /*
> > * Prevent userspace (eg blkid or xfs_db) from seeing stale data.
> > * XFS is not coherent with the bdev's page cache.
> > */
>
> Well, this completely misses the point. The point is that XFS should
> never have to invalidate the page cache because it's not using it,
> but it has to due to weird races. I tried to condese the message but
> I could not come up with a good one that's not losing information.

Agreed -- it took me a while to set up an arm64 box with just the right
debugging info to figure out why certain fstests were flaky. I do think
it's useful (despite my other reply to willy) to retain the defect
details for hard-to-reproduce errors, and the only way to do that
without encountering the dead url problem is to dump it in a huge
commit message or a comment.

(Too bad there's no way to have a commit whose code comments reference
the commit id of that commit to say "Hey, you need to read this commit
before you touch this line"...)

--D