2002-11-11 18:17:41

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] flush_cache_page while pte valid

On some architectures (cachetlb.txt gives HyperSparc as an example)
it is essential to flush_cache_page while pte is still valid: the
rmap VM diverged from the base 2.4 VM before that fix was made,
so this error has crept back into 2.5.

Patch below applies to 2.5.47 or 2.5.47-mm1 - needs more work over
shared pagetables, but they've silently fallen out of 2.5.47-mm1:
oversight? I'll send Alan the equivalent for 2.4-ac shortly.

(I wonder, what happens if userspace now modifies the page
after the flush_cache_page, before the pte is invalidated?)

--- 2.5.47/mm/rmap.c Mon Oct 7 20:37:50 2002
+++ linux/mm/rmap.c Mon Nov 11 17:01:27 2002
@@ -393,9 +393,9 @@
}

/* Nuke the page table entry. */
+ flush_cache_page(vma, address);
pte = ptep_get_and_clear(ptep);
flush_tlb_page(vma, address);
- flush_cache_page(vma, address);

/* Store the swap location in the pte. See handle_pte_fault() ... */
if (PageSwapCache(page)) {


2002-11-11 21:13:52

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

> /* Nuke the page table entry. */
>+ flush_cache_page(vma, address);
> pte = ptep_get_and_clear(ptep);
> flush_tlb_page(vma, address);
>- flush_cache_page(vma, address);
>

Is it correct that this are 3 arch hooks that must appear back to back?
What about one hook with all parameters?

pte = ptep_get_and_clear_and_flush(ptep, vma, address);

The current implementation just asks for such errors.

Documentation:
ptep_get_and_clear_and_flush() removes the page table entry *ptep from the page tables and clears all caches (tlb, cpu cache if virtually tagged).
The return value contains the final value of the pte. The critical information is the dirty bit in the pte - on SMP one cpu can call ptep_get_and_clear_and_flush() while another cpu accesses the mmap'ing from user space.


--
Manfred


2002-11-11 23:14:43

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

From: Hugh Dickins <[email protected]>
Date: Mon, 11 Nov 2002 18:25:25 +0000 (GMT)

On some architectures (cachetlb.txt gives HyperSparc as an example)
it is essential to flush_cache_page while pte is still valid: the
rmap VM diverged from the base 2.4 VM before that fix was made,
so this error has crept back into 2.5.
...
(I wonder, what happens if userspace now modifies the page
after the flush_cache_page, before the pte is invalidated?)

Thanks for catching this.

On architectures that are affected (such as the mentioned HyperSPARC
chips), the cpu will take a trap and OOPS the kernel if the PTE is
invalidated before the cache flush is made.

2002-11-11 23:33:24

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN


So how are apps able to specify such larger hw addresses to configure
a driver if IFHWADDRLEN is still 6?

I'm not going to increase MAX_ADDR_LEN if there is no user ABI capable
of configuring such larger addresses properly.

2002-11-11 23:27:54

by Roland Dreier

[permalink] [raw]
Subject: [PATCH] [RFC] increase MAX_ADDR_LEN

Hi Dave,

I posted this to lkml last week and didn't get any response, positive
or negative, so I'm sending this directly to you. Would you include
these patches (or something that solves the same problem) in 2.5?
This will make it possible to add IP-over-InfiniBand when the driver
is ready without changes to the core kernel. Since the driver will
not be ready until 2.6 time, I would very much like to get all the
core changes required into the kernel now, before it's too late.

I'm very open to a suggestion of a better way to handle the fact that
IPoIB has a 20 byte hardware address, and I'd be happy to implement a
patch for an alternative approach if that's what it will take to get
support into the kernel.

Thanks,
Roland <[email protected]>

Here's my previous email:

Included below are some changes that I would like to see included (in
some form) in the standard kernel. They will make future support for
InfiniBand much simpler; while the code at <http://infiniband.sf.net>
is nowhere near ready for widespread use (let alone inclusion in the
kernel), if we can get these small changes in the kernel then it will
be much simpler to distribute and use InfiniBand drivers in the future.

Note that I am definitely not asking that these patches be included
as-is in the kernel. I would just like to open discussion by raising
an issue and proposing one possible solution.

These patches (against Linus's BK tree of a few days ago):

1. Add ARPHRD_INFINIBAND to if_arp.h. This is extremely minor, but
does make an IP-over-InfiniBand driver cleaner. My feeling is we
might as well do this.

2. Increase MAX_ADDR_LEN to 32 (from 8) and make some small changes
to prevent this from causing problems for existing code.

3. Get rid of suspicious-looking uses of MAX_ADDR_LEN in the sungem
and s390/net/lcs drivers. I have no idea whether these changes
are necessary but I wanted to call this code to the maintainer's
attention.

I expect #2 to be somewhat controversial, so let me give my motivation
for proposing this. The IETF draft for IP-over-InfiniBand (which
seems very unlikely to change at this point) defines the IPoIB
hardware address to be 20 bytes long. Of course, not everyone feels
this was the best way to define the encapsulation, but the issue was
extensively debated in the IETF ipoib working group, and the 20 byte
hardware address seems to be the least bad option.

(By the way, I have no particular attachment to the exact number 32.
As long as we raise MAX_ADDR_LEN to at least 20, I'll be happy. It
does seem desirable for MAX_ADDR_LEN to be a multiple of 8. With this
in mind, 24 would be perfectly acceptable for IPoIB; I just chose 32
becase I had to choose something!)

There are two ways to implement an IPoIB network driver:

1. The driver can tell the kernel that the hardware address length
is less than it really is, and rewrite packets as they pass into
and out of the driver. I've actually implemented this, and while
it works, it is ugly, complex, and implies that special tools are
required in userspace (for example, creating an ARP entry
requires a special mechanism for passing the hardware address
directly to the driver, and then creating a corresponding entry
in the kernel's ARP table).

2. The driver can give the real 20-byte hardware address length to
the kernel. This seems much cleaner, but of course it requires
the value of MAX_ADDR_LEN to be increased.

Since the second option seems so much better, I would like to get the
needed changes into the kernel before the 2.6 release. If this change
is not in the standard kernel, then vendor kernels will likely not
pick it up. This will substantially complicate the task of developing
an IPoIB driver, since anyone who wants to test or use the driver will
have to patch their kernel. If these changes go into the standard
kernel, then an IPoIB driver can simply be distributed as a module
that is compiled and loaded into any 2.6 kernel.

Please try these patches and let me know if they cause any problems
for you. I am currently running these patches and a small dummy net
driver that registers a device with type ARPHRD_INFINIBAND and
hardware address length 20 on my workstation without trouble.
ifconfig does produce bogus output for the dummy network device but
that is another battle. (I would be interested in ideas for how to
resolve the fact that the sa_data member of struct sockaddr is only 14
bytes long)

Of course I am also eager to find out other developers' thoughts on
how to implement IPoIB on Linux.

Thanks,
Roland <[email protected]>

===================================================================

You can import this changeset into BK by piping this whole message to:
'| bk receive [path to repository]' or apply the patch as usual.

===================================================================


[email protected], 2002-11-07 11:37:37-08:00, [email protected]
Increase MAX_ADDR_LEN from 8 to 32 (to support
IP-over-InfiniBand hardware addresses)

[email protected], 2002-11-07 11:35:31-08:00, [email protected]
Add IANA-assigned ARPHRD_INFINIBAND value for IP-over-InfiniBand to if_arp.h


drivers/net/sungem.c | 2 +-
drivers/s390/net/lcs.c | 2 +-
include/linux/if_arp.h | 1 +
include/linux/netdevice.h | 2 +-
net/core/dev.c | 4 ++--
5 files changed, 6 insertions(+), 5 deletions(-)


diff -Nru a/drivers/net/sungem.c b/drivers/net/sungem.c
--- a/drivers/net/sungem.c Thu Nov 7 12:47:32 2002
+++ b/drivers/net/sungem.c Thu Nov 7 12:47:32 2002
@@ -2858,7 +2858,7 @@
printk(KERN_ERR "%s: can't get mac-address\n", dev->name);
return -1;
}
- memcpy(dev->dev_addr, addr, MAX_ADDR_LEN);
+ memcpy(dev->dev_addr, addr, 6);
#else
get_gem_mac_nonobp(gp->pdev, gp->dev->dev_addr);
#endif
diff -Nru a/drivers/s390/net/lcs.c b/drivers/s390/net/lcs.c
--- a/drivers/s390/net/lcs.c Thu Nov 7 12:47:32 2002
+++ b/drivers/s390/net/lcs.c Thu Nov 7 12:47:32 2002
@@ -3569,7 +3569,7 @@
struct ifmcaddr6 *im6;
struct inet6_dev *in6_dev;
#endif
- char buf[MAX_ADDR_LEN];
+ char buf[LCS_ADDR_LEN];
lcs_ipm_list *curr_lmem, *new_lmem;
lcs_state oldstate;

diff -Nru a/include/linux/if_arp.h b/include/linux/if_arp.h
--- a/include/linux/if_arp.h Thu Nov 7 12:47:32 2002
+++ b/include/linux/if_arp.h Thu Nov 7 12:47:32 2002
@@ -40,6 +40,7 @@
#define ARPHRD_METRICOM 23 /* Metricom STRIP (new IANA id) */
#define ARPHRD_IEEE1394 24 /* IEEE 1394 IPv4 - RFC 2734 */
#define ARPHRD_EUI64 27 /* EUI-64 */
+#define ARPHRD_INFINIBAND 32 /* InfiniBand */

/* Dummy types for non ARP hardware */
#define ARPHRD_SLIP 256
diff -Nru a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h Thu Nov 7 12:47:32 2002
+++ b/include/linux/netdevice.h Thu Nov 7 12:47:32 2002
@@ -65,7 +65,7 @@

#endif

-#define MAX_ADDR_LEN 8 /* Largest hardware address length */
+#define MAX_ADDR_LEN 32 /* Largest hardware address length */

/*
* Compute the worst case header length according to the protocols
diff -Nru a/net/core/dev.c b/net/core/dev.c
--- a/net/core/dev.c Thu Nov 7 12:47:32 2002
+++ b/net/core/dev.c Thu Nov 7 12:47:32 2002
@@ -2062,7 +2062,7 @@

case SIOCGIFHWADDR:
memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,
- MAX_ADDR_LEN);
+ min(sizeof ifr->ifr_hwaddr.sa_data, dev->addr_len));
ifr->ifr_hwaddr.sa_family = dev->type;
return 0;

@@ -2083,7 +2083,7 @@
if (ifr->ifr_hwaddr.sa_family != dev->type)
return -EINVAL;
memcpy(dev->broadcast, ifr->ifr_hwaddr.sa_data,
- MAX_ADDR_LEN);
+ min(sizeof ifr->ifr_hwaddr.sa_data, dev->addr_len));
notifier_call_chain(&netdev_chain,
NETDEV_CHANGEADDR, dev);
return 0;

2002-11-11 23:47:09

by Alan

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

On Mon, 2002-11-11 at 23:38, David S. Miller wrote:
>
> So how are apps able to specify such larger hw addresses to configure
> a driver if IFHWADDRLEN is still 6?
>
> I'm not going to increase MAX_ADDR_LEN if there is no user ABI capable
> of configuring such larger addresses properly.

The kernel just ignores it. We support multiple devices with larger
address lengths. Its mostly a legacy constant

2002-11-11 23:52:14

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

>>>>> "David" == David S Miller <[email protected]> writes:

David> So how are apps able to specify such larger hw addresses to
David> configure a driver if IFHWADDRLEN is still 6?

In the InfiniBand case, the device's hardware address comes from a
combination of the port GID (which is set by the InfiniBand subnet
manager through an IB-specific mechanism) and the queue pair number
that the driver gets when it initializes. There definitely still are
problems to solve, such as specifying static ARP entries.

David> I'm not going to increase MAX_ADDR_LEN if there is no user
David> ABI capable of configuring such larger addresses properly.

What would you consider a palatable ABI? (I'm happy to implement it)
Enlarging sa_data in struct sockaddr doesn't seem feasible. I guess
we could add a new socket ioctl() or extend SIOCGIFHWADDR/SIOCSIFHWADDR
somehow....

Thanks,
Roland <[email protected]>

2002-11-11 23:54:56

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

>>>>> "Alan" == Alan Cox <[email protected]> writes:

>>>>> On Mon, 2002-11-11 at 23:38, David S. Miller wrote:

Dave> So how are apps able to specify such larger hw addresses to
Dave> configure a driver if IFHWADDRLEN is still 6?

Dave> I'm not going to increase MAX_ADDR_LEN if there is no user
Dave> ABI capable of configuring such larger addresses properly.

Alan> The kernel just ignores it. We support multiple devices with
Alan> larger address lengths. Its mostly a legacy constant

What drivers in the kernel are there with address length more than
MAX_ADDR_LEN? What do they put dev->addr_len and dev->dev_addr?

Thanks,
Roland <[email protected]>

2002-11-12 00:31:04

by Russell King

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

On Mon, Nov 11, 2002 at 10:20:30PM +0100, Manfred Spraul wrote:
> > /* Nuke the page table entry. */
> >+ flush_cache_page(vma, address);
> > pte = ptep_get_and_clear(ptep);
> > flush_tlb_page(vma, address);
> >- flush_cache_page(vma, address);
> >
>
> Is it correct that this are 3 arch hooks that must appear back to back?
> What about one hook with all parameters?
>
> pte = ptep_get_and_clear_and_flush(ptep, vma, address);
>
> The current implementation just asks for such errors.

Its actually a very simple rule. The sequence must be:

- flush cache for area
- change pte entries in area
- flush tlb for area

Anything else is just buggy, and may very well be racy. Think about the
race when you flush the tlb entry before changing the pte.

Rather than creating a new interface that's only useful for 10% of the
cases, I'd prefer to keep the rule personally. The smaller the number
of functions each with their own particular set of behaviours doing
almost the same job, the less chance of getting the wrong function.
And, IMHO, the easier it is to audit the code.

grep -4 ptep_get_and_clear mm/*.c

vs

"Is this the right function here?"

PS, I see one place where "ptep_get_and_clear_and_flush" would be useful
out of 6 uses.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-11-12 01:02:04

by Ulrich Weigand

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

Manfred Spraul wrote:

>Is it correct that this are 3 arch hooks that must appear back to back?
>What about one hook with all parameters?
>
> pte = ptep_get_and_clear_and_flush(ptep, vma, address);

This interface would help us on s390 very much. We have a hardware
instruction (INVALIDATE PAGE TABLE ENTRY) that implements the combination
of
pte = ptep_get_and_clear(ptep);
flush_tlb_page(vma, address);

very efficiently, but we cannot use it to implement flush_tlb_page
alone, because it requires the pte to be valid.

This is why our current in-tree flush_tlb_page is quite inefficient
(it always flushes the complete TLB) ...

If we had a combined hook, we could use our IPTE instruction.

Bye,
Ulrich

--
Dr. Ulrich Weigand
[email protected]

2002-11-12 06:48:27

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

From: Hugh Dickins <[email protected]>
Date: Tue, 12 Nov 2002 06:53:04 +0000 (GMT)

Thanks for shedding light on that; but I'm still wondering if there
might be data loss if userspace modifies the page in the tiny window
between correctly positioned flush_cache_page and pte invalidation?

The flush merely writes back the data, a copy-back operation, fully L2
cache coherent. All cpus will see correct data if an intermittant
store occurs.

2002-11-12 06:45:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

On Mon, 11 Nov 2002, David S. Miller wrote:
> From: Hugh Dickins <[email protected]>
> Date: Mon, 11 Nov 2002 18:25:25 +0000 (GMT)
>
> On some architectures (cachetlb.txt gives HyperSparc as an example)
> it is essential to flush_cache_page while pte is still valid: the
> rmap VM diverged from the base 2.4 VM before that fix was made,
> so this error has crept back into 2.5.
> ...
> (I wonder, what happens if userspace now modifies the page
> after the flush_cache_page, before the pte is invalidated?)
>
> Thanks for catching this.
>
> On architectures that are affected (such as the mentioned HyperSPARC
> chips), the cpu will take a trap and OOPS the kernel if the PTE is
> invalidated before the cache flush is made.

Thanks for shedding light on that; but I'm still wondering if there
might be data loss if userspace modifies the page in the tiny window
between correctly positioned flush_cache_page and pte invalidation?

Hugh

2002-11-12 13:52:02

by Alan

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

On Tue, 2002-11-12 at 00:01, Roland Dreier wrote:
> What drivers in the kernel are there with address length more than
> MAX_ADDR_LEN? What do they put dev->addr_len and dev->dev_addr?

AX.25 addresses are 7 bytes long at the physical layer, but because they
including routing info up to about 60 bytes long elsewhere. Fortunately
lengths are passed to all but the hw level SIOCSIF/SIOCGIF calls, and
even those can be increased a little without harm as they use the
struct sockaddr - in fact I think 14 bytes would be the limit.

Seems trivial to do in 2.5 and quite doable in 2.4 if you actually have
to worry about this at the SIOCSIFADDR/GIFHWADDR level.

Alan

2002-11-12 14:13:48

by folkert

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

Hi,

Am I right that in net/ipv4/tcp_ipv4.c in function "tcp_v4_get_port" the
portnumber for a new connection is generated? Because fiddling with that
code seems to have no effect on the portnumbers generated for new
connections.


Folkert


2002-11-12 14:43:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

On Mon, 11 Nov 2002, David S. Miller wrote:
> From: Hugh Dickins <[email protected]>
> Date: Tue, 12 Nov 2002 06:53:04 +0000 (GMT)
>
> Thanks for shedding light on that; but I'm still wondering if there
> might be data loss if userspace modifies the page in the tiny window
> between correctly positioned flush_cache_page and pte invalidation?
>
> The flush merely writes back the data, a copy-back operation, fully L2
> cache coherent. All cpus will see correct data if an intermittant
> store occurs.

The CPUs will, but the harddisk might not.

We really need to get this right in the swapout path.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-11-12 15:07:42

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

>>>>> "Alan" == Alan Cox <[email protected]> writes:

Alan> AX.25 addresses are 7 bytes long at the physical layer, but
Alan> because they including routing info up to about 60 bytes
Alan> long elsewhere. Fortunately lengths are passed to all but
Alan> the hw level SIOCSIF/SIOCGIF calls, and even those can be
Alan> increased a little without harm as they use the struct
Alan> sockaddr - in fact I think 14 bytes would be the limit.

Alan> Seems trivial to do in 2.5 and quite doable in 2.4 if you
Alan> actually have to worry about this at the
Alan> SIOCSIFADDR/GIFHWADDR level.

Hmm... the problem I would like to solve is that IP-over-InfiniBand
has 20 byte hardware addresses. One can implement a driver that lies
about its HW address len (you have an internal ARP cache and only
expose a hash value/cookie to the rest of the world). In fact I've
done just that for IPoIB.

It works but it's ugly and overly complex. I would like to find a
clean solution for 2.5/2.6, but it looks like it will require changes
to the net core. I'd like to do those now so the IPoIB driver can
just drop in cleanly later. (I want to be able to just add
drivers/infiniband during 2.6 without and invasive changes required)

To do that seems to require increasing MAX_ADDR_LEN -- otherwise I
don't see what the driver can put in addr_len/dev_addr.

Thanks,
Roland

2002-11-12 15:28:42

by Alan

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

On Tue, 2002-11-12 at 15:14, Roland Dreier wrote:
> Hmm... the problem I would like to solve is that IP-over-InfiniBand
> has 20 byte hardware addresses. One can implement a driver that lies
> about its HW address len (you have an internal ARP cache and only
> expose a hash value/cookie to the rest of the world). In fact I've
> done just that for IPoIB.

Our ARP code handles AX.25 fine, and that can include long addresses, so
either we have a live bug or it works 8). The multicast layer does have
problems with addresses over 8 bytes (MAX_ADDR_LEN).

> It works but it's ugly and overly complex. I would like to find a
> clean solution for 2.5/2.6, but it looks like it will require changes
> to the net core. I'd like to do those now so the IPoIB driver can
> just drop in cleanly later. (I want to be able to just add
> drivers/infiniband during 2.6 without and invasive changes required)

I think you need to do two things.

1. Increase MAX_ADDR_LEN
2. Add some new address setting ioctls, and ensure the old ones keep the
old address length limit. That is needed because the old caller wont
have allocated enough address space for a 20 byte address return.

You have to solve both though, and the first patch should probably be
the one to add more sensible address set/get functions.

Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

In article <[email protected]> (at 12 Nov 2002 16:00:36 +0000), Alan Cox <[email protected]> says:

> 2. Add some new address setting ioctls, and ensure the old ones keep the
> old address length limit. That is needed because the old caller wont
> have allocated enough address space for a 20 byte address return.
>
> You have to solve both though, and the first patch should probably be
> the one to add more sensible address set/get functions.

*BSDs have SIOCGLIFPHYADDR etc., but, they're obsolete;
we should use rtnetlink (or routing socket in BSDs) to manage
addresses. So, not having such ioctls for long addresses
would be ok.

--
Hideaki YOSHIFUJI @ USAGI Project <[email protected]>
GPG FP: 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA

2002-11-12 17:35:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

On Mon, 11 Nov 2002, David S. Miller wrote:
> From: Hugh Dickins <[email protected]>
> Date: Tue, 12 Nov 2002 06:53:04 +0000 (GMT)
>
> Thanks for shedding light on that; but I'm still wondering if there
> might be data loss if userspace modifies the page in the tiny window
> between correctly positioned flush_cache_page and pte invalidation?
>
> The flush merely writes back the data, a copy-back operation, fully L2
> cache coherent. All cpus will see correct data if an intermittant
> store occurs.

Sorry, I still don't get it. If the flush_cache_page is doing something
necessary, then won't a user access in between it and invalidating pte
undo what was necessary? And if it's not necessary, why do we do it?
(For better performance would be a very good reason.)

But don't worry about me: I may well have some fundamental misconception
which emailing back and forth will fail to resolve, would just waste your
time and expose my ignorance. (I think Andrew has sometimes protested
that "flush" can mean too many different things.) So I'm trying to
understand it better from over here - but glad to see Rik seems
to understand my concern.

Hugh

2002-11-12 20:29:22

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

>>>>> "Alan" == Alan Cox <[email protected]> writes:

Alan> 1. Increase MAX_ADDR_LEN 2. Add some new address setting
Alan> ioctls, and ensure the old ones keep the old address length
Alan> limit. That is needed because the old caller wont have
Alan> allocated enough address space for a 20 byte address return.

Thanks to YOSHIFUJI Hideaki / $B5HF#1QL@(B, I had a look at rtnetlink. It
seems like we would get the necessary address setting functionality if
I implemented the following:

1. Add an RTM_SETLINK message type that handles at least the
IFLA_ADDRESS attribute. This would replace SIOCSIFHWADDR for
interfaces with long hardware addresses.

2. Add code to handle receiving RTM_NEWNEIGH and RTM_DELNEIGH
messages from user space. This would replace SIOCSARP and
SIOCDARP for interfaces with long hardware addresses.

Dave, Alan, if I wrote a patch to do this would you accept it? (And
following that increase MAX_ADDR_LEN?)

(By the way the original patch I posted added code to the
SIOCSIFHWADDR/SIOCGIFHWADDR handler to prevent a long hardware address
from overrunning the ifr_data member that user space passed in)

Thanks,
Roland

2002-11-12 20:46:49

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

>
>
>>
>> The flush merely writes back the data, a copy-back operation, fully L2
>> cache coherent. All cpus will see correct data if an intermittant
>> store occurs.
>
>The CPUs will, but the harddisk might not.
>
>
The lost dirty bit can only happen on cpus where the hardware maintains
a dirty bit. I doubt that the sparc cpus do that in hardware.

But like Hugh I don't understand how the cache writeback works on SMP.

cpu1: cpu 2
access a mmaping, pte loaded into TLB
try_to_unmap_one()
flush_cache_page();
access the mmaping again. pte either still from
tlb, or reloaded from pte.
ptep_get_and_clear();
access the mmaping again, using the tlb
flush_tlb()
??? How will the cpu write back now?

If the write back happens based on the tlb, then I don't understand why
it's needed at all.

Regarding the dirty bit:
The assumption for the dirty bit is that the i386 cpu are the only cpus
in the world (TM) that maintain the dirty bit in hardware, and tests on
several i386 cpus have shown that the tlb walker retests the present bit
before setting the dirty bit. Software tlb implementations must emulate
that.

Thus it's guaranteed that
- if the dirty bit is not set in the result of ptep_get_and_clear, then
no write operation has happened or will happen.
- if the dirty bit is set, then write operations could happen until the
tlb flush.
- there will be no spuriously set dirty bits in the page tables.

--
Manfred

2002-11-12 21:25:17

by Ulrich Weigand

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

Manfred Spraul wrote:

>The assumption for the dirty bit is that the i386 cpu are the only cpus
>in the world (TM) that maintain the dirty bit in hardware

Not true; the s390 also has a dirty bit in hardware ;-)

To make things more interesting, the s390 dirty bit does not
reside in the pte, but in a storage key associated with the
*physical* page. The bit is set on any access to the physical
page, whether through any virtual mapping, by direct access
with dynamic address translation switched off, or by an I/O
device moving data directly to main memory.

Bye,
Ulrich

--
Dr. Ulrich Weigand
[email protected]

2002-11-12 21:40:54

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

From: Rik van Riel <[email protected]>
Date: Tue, 12 Nov 2002 12:49:50 -0200 (BRST)

On Mon, 11 Nov 2002, David S. Miller wrote:
> The flush merely writes back the data, a copy-back operation, fully L2
> cache coherent. All cpus will see correct data if an intermittant
> store occurs.

The CPUs will, but the harddisk might not.

We really need to get this right in the swapout path.

We do get it right, watch:

1) remove last mapping instance of page

-> MM level cache flush pushes it permanently to ram
in full view of DMA activity

2) remove last mapping, but new one appears as we swap out

-> no problem we'll see one of several instances of the
page and we'll need to reswap it out later anyways
if someone currently writes to it

I know this because I tested this extensively ages ago on sparc32
where it really mattered.

2002-11-12 21:47:29

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

From: Hugh Dickins <[email protected]>
Date: Tue, 12 Nov 2002 17:43:40 +0000 (GMT)

Sorry, I still don't get it. If the flush_cache_page is doing something
necessary, then won't a user access in between it and invalidating pte
undo what was necessary? And if it's not necessary, why do we do it?
(For better performance would be a very good reason.)

If there are other writable mappings of the page, we can't swap
it out legally.

2002-11-12 21:56:17

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

From: Manfred Spraul <[email protected]>
Date: Tue, 12 Nov 2002 21:53:14 +0100

The lost dirty bit can only happen on cpus where the hardware maintains
a dirty bit. I doubt that the sparc cpus do that in hardware.

sun4m sparc32 chips do, exactly like x86

Franks a lot,
David S. Miller
[email protected]

2002-11-12 22:32:43

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

>>>>> "David" == David S Miller <[email protected]> writes:

Roland> if I wrote a patch to do this would you accept it? (And
Roland> following that increase MAX_ADDR_LEN?)

David> I would have to see it first, but likely yes.

Of course, that's what I would expect. (And I expect to go through a
few revisions before it's good enough) I just don't want to start on
something that you think is categorically a stupid idea.

I will code up the rtnetlink extensions I proposed and post them when
they are ready.

Thanks,
Roland

2002-11-12 22:26:39

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] [RFC] increase MAX_ADDR_LEN

From: Roland Dreier <[email protected]>
Date: 12 Nov 2002 12:36:10 -0800

Dave, Alan, if I wrote a patch to do this would you accept it? (And
following that increase MAX_ADDR_LEN?)

I would have to see it first, but likely yes.

2002-11-12 22:52:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] flush_cache_page while pte valid

On Tue, 12 Nov 2002, David S. Miller wrote:

> From: Hugh Dickins <[email protected]>
> Date: Tue, 12 Nov 2002 17:43:40 +0000 (GMT)
>
> Sorry, I still don't get it. If the flush_cache_page is doing something
> necessary, then won't a user access in between it and invalidating pte
> undo what was necessary? And if it's not necessary, why do we do it?
> (For better performance would be a very good reason.)
>
> If there are other writable mappings of the page, we can't swap
> it out legally.

But I'm worried about the case where this is the last writable mapping:
it seems userspace (on another CPU) can still write to it in between
the flush_cache_page and invalidation of the pte (on this CPU hoping
to swap out that page).

Hugh