2009-09-26 03:15:43

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

The swap cache and page cache code assume that they 'own' the newly
allocated page and therefore can disregard the locking rules. However
now hwpoison can hit any time on any page.

So use the safer lock_page()/trylock_page(). The main intention is not
to close such a small time window of memory corruption. But to avoid
kernel oops that may result from such races, and also avoid raising
false alerts in hwpoison stress tests.

This in theory will slightly increase page cache/swap cache overheads,
however it seems to be too small to be measurable in benchmark.

CC: Hugh Dickins <[email protected]>
CC: Andi Kleen <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/pagemap.h | 13 ++++---------
mm/migrate.c | 2 +-
mm/swap_state.c | 4 ++--
3 files changed, 7 insertions(+), 12 deletions(-)

--- sound-2.6.orig/mm/swap_state.c 2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/mm/swap_state.c 2009-09-25 18:42:23.000000000 +0800
@@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
* re-using the just freed swap entry for an existing page.
* May fail (-ENOMEM) if radix-tree node allocation failed.
*/
- __set_page_locked(new_page);
+ lock_page(new_page);
SetPageSwapBacked(new_page);
err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
@@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
return new_page;
}
ClearPageSwapBacked(new_page);
- __clear_page_locked(new_page);
+ unlock_page(new_page);
swapcache_free(entry, NULL);
} while (err != -ENOMEM);

--- sound-2.6.orig/include/linux/pagemap.h 2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/include/linux/pagemap.h 2009-09-25 18:42:19.000000000 +0800
@@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
extern void __lock_page_nosync(struct page *page);
extern void unlock_page(struct page *page);

-static inline void __set_page_locked(struct page *page)
-{
- __set_bit(PG_locked, &page->flags);
-}
-
static inline void __clear_page_locked(struct page *page)
{
__clear_bit(PG_locked, &page->flags);
@@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
extern void __remove_from_page_cache(struct page *page);

/*
- * Like add_to_page_cache_locked, but used to add newly allocated pages:
- * the page is new, so we can just run __set_page_locked() against it.
+ * Like add_to_page_cache_locked, but used to add newly allocated pages.
*/
static inline int add_to_page_cache(struct page *page,
struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
{
int error;

- __set_page_locked(page);
+ if (!trylock_page(page))
+ return -EIO; /* hwpoisoned */
error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
if (unlikely(error))
- __clear_page_locked(page);
+ unlock_page(page);
return error;
}

--- sound-2.6.orig/mm/migrate.c 2009-09-14 10:50:19.000000000 +0800
+++ sound-2.6/mm/migrate.c 2009-09-25 18:42:19.000000000 +0800
@@ -551,7 +551,7 @@ static int move_to_new_page(struct page
* holding a reference to the new page at this point.
*/
if (!trylock_page(newpage))
- BUG();
+ return -EAGAIN; /* got by hwpoison */

/* Prepare mapping for the new page.*/
newpage->index = page->index;


2009-09-26 03:49:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> The swap cache and page cache code assume that they 'own' the newly
> allocated page and therefore can disregard the locking rules. However
> now hwpoison can hit any time on any page.
>
> So use the safer lock_page()/trylock_page(). The main intention is not
> to close such a small time window of memory corruption. But to avoid
> kernel oops that may result from such races, and also avoid raising
> false alerts in hwpoison stress tests.
>
> This in theory will slightly increase page cache/swap cache overheads,
> however it seems to be too small to be measurable in benchmark.

Thanks. Can you please describe what benchmarks you used?

Acked-by: Andi Kleen <[email protected]>
-andi

2009-09-26 10:53:11

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> >
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> >
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
>
> Thanks. Can you please describe what benchmarks you used?

Good question. The test case is to copy 1GB sparse file over NFS.

I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.

The average time reported by the function profiler:

total time add_to_page_cache_lru percent
before 1940083424.361 4841664.805 0.242%
after 1914258447.188 4889044.435 0.250%

The script used is:

for i in `seq 1 100`
do
# copy 1GB sparse file over NFS
echo 1 > /debug/tracing/function_profile_enabled
cp /tmp/sparse/1G /dev/null
echo 0 > /debug/tracing/function_profile_enabled

fadvise /tmp/sparse/1G 0 0 dontneed
cat /debug/tracing/trace_stat/function0 > function0.$i
cat /debug/tracing/trace_stat/function1 > function1.$i
done


File function0.1 after patch:

Function Hit Time Avg
-------- --- ---- ---
schedule 32492 474358128 us 14599.22 us
schedule_timeout 26 99994288 us 3845934 us
poll_schedule_timeout 8 52914376 us 6614297 us
sys_poll 6 52914373 us 8819062 us
schedule_hrtimeout_range 8 52914373 us 6614296 us
do_sys_poll 6 52914355 us 8819059 us
sys_read 86997 40703194 us 467.868 us
vfs_read 87002 40550651 us 466.088 us
do_sync_read 86987 40011054 us 459.965 us
nfs_file_read 86985 39945631 us 459.224 us
generic_file_aio_read 86985 39785858 us 457.387 us
page_cache_async_readahead 2886 34379069 us 11912.35 us
ondemand_readahead 2889 34371850 us 11897.49 us
ra_submit 2891 34368378 us 11888.05 us
__do_page_cache_readahead 2891 34360744 us 11885.41 us
nfs_readpages 2876 33198107 us 11543.15 us
read_cache_pages 2876 31617284 us 10993.49 us
__rpc_execute 84468 25404441 us 300.758 us
readpage_async_filler 174071 24947004 us 143.315 us
do_softirq 44822 24829675 us 553.961 us
__do_softirq 44822 24803408 us 553.375 us
net_rx_action 38424 24357207 us 633.906 us
e1000_clean 39042 24253141 us 621.206 us
nfs_pageio_doio 43534 22993433 us 528.171 us
nfs_pagein_one 43520 22939752 us 527.108 us
nfs_read_rpcsetup 43513 22533176 us 517.849 us
e1000_clean_rx_irq 39042 22047306 us 564.707 us
rpc_run_task 43528 22042766 us 506.404 us
call_transmit 45873 21842655 us 476.154 us
nfs_pageio_add_request 174091 21583666 us 123.979 us
rpc_execute 43528 21254402 us 488.292 us
cpuidle_idle_call 18073 19445493 us 1075.941 us
acpi_idle_enter_c1 17999 19396945 us 1077.667 us
e1000_receive_skb 436279 18818686 us 43.134 us
tcp_v4_do_rcv 513521 18681598 us 36.379 us
xprt_transmit 43346 18532249 us 427.542 us
tcp_rcv_established 513521 18384627 us 35.801 us
napi_gro_receive 436279 18114586 us 41.520 us
napi_skb_finish 436279 17130261 us 39.264 us
netif_receive_skb 436279 16886046 us 38.704 us
ip_rcv 436279 16500338 us 37.820 us
xs_tcp_send_request 43346 16418581 us 378.779 us
xs_sendpages 43346 16367619 us 377.603 us
xs_send_kvec 43347 16310325 us 376.273 us
kernel_sendmsg 43347 16254698 us 374.990 us
sock_sendmsg 43347 16177184 us 373.201 us
tcp_sendmsg 43347 16127746 us 372.061 us
do_IRQ 45271 15968995 us 352.742 us
ip_local_deliver 436279 15816133 us 36.252 us
irq_exit 53709 15421475 us 287.130 us
acpi_processor_ffh_cstate_ente 17999 15394735 us 855.310 us
mwait_idle_with_hints 17999 15386089 us 854.830 us
tcp_v4_rcv 436217 15173852 us 34.785 us
release_sock 43364 13536313 us 312.155 us
xs_tcp_data_ready 512045 10092920 us 19.711 us
_spin_unlock_bh 518900 9851221 us 18.984 us
local_bh_enable_ip 511352 9403704 us 18.389 us
tcp_read_sock 1024090 8832051 us 8.624 us
__tcp_ack_snd_check 512045 6644838 us 12.977 us
add_to_page_cache_lru 174072 6287717 us 36.121 us
tcp_transmit_skb 296572 5962431 us 20.104 us
tcp_send_ack 253210 5775135 us 22.807 us
__kfree_skb 802932 5216451 us 6.496 us
add_to_page_cache_locked 174072 4716425 us 27.094 us
[...]

Detailed time before patch (100 runs * 2 CPU):

Total add_to_page_cache_lru percent
1927694859.404 6703561 0.34775011030908648%
1994166534.446 7311783 0.36665859514242066%
2132812997.930 6249683 0.29302536162643533%
1972582396.980 7908593 0.40092586307715006%
2415371209.622 7670327 0.31756307144194151%
2163473353.153 7406991 0.34236571433640306%
1833544521.635 7952113 0.43370165851816772%
2087748556.683 7995535 0.38297404035585819%
2185868844.795 7117368 0.32560819085499598%
1980423680.193 6514699 0.32895481230385598%
1958021484.352 6745228 0.34449203207963308%
2008592063.702 7398967 0.36836583862444905%
2448117568.931 6224454 0.25425470079519025%
2213021186.789 6938228 0.31351837214297412%
1828902235.789 7479842 0.40897987074596959%
1866644864.837 7720792 0.41361868802366963%
2054526163.572 8475856 0.41254553727677401%
1902862587.026 8334661 0.43800645705196778%
2148449373.925 8672498 0.40366313050031155%
2104880443.231 7741619 0.36779376353160392%
2270646549.540 6486688 0.2856758134071597%
2250814274.826 7512331 0.33376058984612328%
2214008898.106 6391259 0.28867359139646992%
2446226035.589 7738434 0.31634173978271579%
1920377000.358 6464992 0.33665223020244384%
2030446733.714 6022617 0.2966153654759367%
2157128575.464 7375670 0.34192074055731642%
2185670134.683 7943199 0.36342167438509865%
2224381465.969 8492920 0.38181041021667828%
2479045291.455 8161934 0.32923698603382928%
1971641972.432 8114434 0.41155717485517562%
2208169216.013 8138519 0.36856410011433138%
1971312194.985 8122576 0.41203904793283169%
2279129743.345 7365286 0.3231622079219687%
2360162707.877 7412468 0.3140659741491984%
1885022652.013 8252339 0.43778460652382062%
2065345722.062 7985361 0.3866355600759942%
1975982196.916 7925371 0.4010851419799969%
2169738273.075 7203856 0.33201497569522709%
2034570957.887 7396992 0.36356520136717829%
1789538123.758 7380292 0.41241323121418116%
1912812047.546 8697404 0.45469203370807615%
2221047646.893 7451321 0.33548676951723994%
2163981380.559 7982481 0.36887937538251664%
1969189605.396 7353112 0.37340802428831149%
2278909389.270 7088803 0.31106120468750831%
2104996788.904 7342692 0.34882200479854836%
2374705789.293 7327329 0.30855733931492624%
2137944282.456 8112283 0.3794431438915179%
2268117031.230 6889022 0.30373309247909852%
1899091967.945 8173918 0.43041190937398177%
2098779842.987 7858621 0.37443760603377763%
2124662819.215 8156209 0.38388251190904155%
2317714239.959 7048899 0.30413149638864423%
2218944220.453 6659501 0.30012025262358577%
1945126557.801 7704643 0.3960998305791596%
2092060468.851 7616315 0.36405807161888715%
1933379809.531 7300893 0.37762331870896354%
1903832170.769 7508856 0.39440745435912067%
2435478683.656 6790800 0.27882814354203433%
2126189870.763 7104836 0.33415811530747125%
2161363441.772 7011298 0.32439236569356278%
2064449342.765 6822211 0.33046153561039859%
2116816508.947 8156178 0.38530396779913867%
1982774633.856 7131335 0.35966442571092105%
2323253766.182 7275361 0.3131539526978242%
1905927501.820 7477458 0.39232646534874271%
2252162684.001 8064426 0.35807475442552977%
2027146552.281 7875866 0.38851981328818402%
1845397260.011 7598767 0.41176862915439172%
2177206310.501 7366288 0.33833670077434846%
2041629016.560 6600368 0.32328929234759562%
2189308825.112 8348863 0.38134697600613249%
1930990433.231 7344038 0.38032492930126544%
2102320863.469 7299411 0.34720727586536809%
1895798071.683 8051595 0.42470741585110772%
2019648719.623 7992832 0.39575357448754711%
1961118039.267 7837733 0.39965636147681766%
2142238585.905 8084509 0.37738602288244455%
2263212432.641 7757018 0.34274369865263332%
2200111036.549 7069702 0.32133387281622078%
1941699880.568 7786856 0.4010329339734075%
1939777157.562 7729686 0.39848319534368681%
1983015138.927 8136252 0.41029701893261827%
2040218650.019 7968023 0.39054750332400873%
1850721920.862 7835912 0.42339758943095634%
2167986192.252 8656589 0.39929170356052646%
2128868496.040 8248906 0.38747841942065214%
1984840247.084 7754853 0.39070413910605312%
2130780080.168 7053725 0.33103955990821227%
2062729676.752 7968701 0.38631824081513272%
1995756518.448 7903687 0.39602461156666058%
1921150053.008 8307746 0.43243608103345821%
2253731991.230 8033382 0.35644797301810893%
2143807312.838 6755593 0.31512127790332323%
1948787937.040 7152297 0.36701258582622248%
2276500726.783 7277806 0.31969267193182554%
2282733838.581 6336839 0.27759867983291148%
2003104438.909 7092547 0.35407774363791977%
2239684291.880 6982252 0.3117516171950766%
1801833773.742 2645976 0.14684906224756283%
2016683111.150 2261656 0.11214731692329717%
1962727948.601 2767184 0.14098663046870061%
1845203884.375 2266317 0.1228220371304734%
1516914718.568 1975612 0.13023883121557553%
1557730294.184 2109274 0.13540688063108641%
2171303320.533 1865384 0.085910797554627022%
1617322895.641 2073254 0.12819048104666192%
1867107731.934 1901409 0.10183713384500154%
1772588329.145 2045529 0.11539786008783279%
2030802387.562 2161163 0.10641916777508319%
1853583237.019 1803299 0.097287187539531861%
1453728764.040 3231016 0.22225714176699726%
1989811798.738 2418414 0.1215398361560542%
1956758511.567 2254818 0.11523230826241863%
2001579398.410 1952360 0.097540971971978793%
1636515132.388 1532408 0.093638486419854422%
1926271619.408 1918912 0.09961793449408439%
1642519133.238 1384193 0.084272564744573439%
1677033195.955 1340416 0.07992781557533149%
1693983317.244 3135804 0.18511421972571418%
1487434932.071 2467493 0.1658891388656871%
1720442169.343 3034092 0.17635536108479918%
1640362983.552 1944463 0.11853858075908964%
1821382248.526 2659828 0.1460334864991977%
1947322513.836 3540608 0.18181929160904178%
1673422299.278 2769832 0.16551900863249205%
1836039325.613 2175872 0.11850900847526999%
1582929718.539 2072026 0.13089816785500888%
1476014468.174 1996943 0.13529291501257767%
1847131237.607 2120975 0.11482535495137695%
1734796451.802 1842211 0.10619176665287874%
1804827020.363 1988047 0.11015166426310204%
1647250758.752 2634152 0.15991202225917941%
1506833507.228 2271966 0.15077750720977479%
1919459367.994 1608374 0.083793073550751385%
1913364635.281 2126200 0.11112361756846952%
1733934046.047 2031440 0.11715785872197679%
1762796865.803 2285487 0.12965118354456009%
1711072016.023 2270996 0.13272357789349024%
2082560830.257 2273483 0.10916766353083858%
1857948273.729 1266725 0.06817870109255604%
1738287034.188 2404600 0.1383315846409251%
1618942973.292 2016207 0.12453848179100423%
1820789056.854 2369799 0.13015230902664757%
1561747605.669 2487149 0.15925422206327566%
1884502059.843 2512148 0.13330566485076115%
1352203145.315 2618029 0.19361210695824271%
1911003204.592 1648292 0.086252707271200574%
1525244016.511 2555354 0.16753738892517536%
2190575134.148 1939935 0.088558249829422828%
1643983673.168 1747429 0.10629235730989092%
1781900292.843 1686195 0.094629032094141294%
1566821289.746 2291057 0.14622324926229507%
1777199439.264 2456866 0.13824368530171682%
1633063412.403 1953080 0.11959609070698031%
1913601422.874 1821650 0.095194849785599556%
2094732749.709 2322465 0.11087166132876074%
1951529182.943 2270128 0.11632559839953495%
1450396654.276 2867633 0.19771370759479912%
1643954213.946 2460211 0.14965203891504558%
1845971802.604 2593942 0.14051904781757141%
1671974226.831 2552423 0.15265923116756236%
1792483105.554 1547320 0.086322710390164162%
1774273751.516 2101959 0.1184686973024323%
1797523443.598 1781759 0.099122990932099045%
1877038808.523 2131598 0.11356174365288201%
1485614991.186 1796722 0.12094129439052281%
2008042349.453 1972441 0.09822706182154485%
2178564679.489 1948988 0.089462021410222761%
1462911389.586 2076158 0.14191960051575969%
1976013519.965 2692701 0.13626936115536772%
1473561005.083 1369873 0.092963439944099252%
2190713955.687 2450228 0.11184609444967988%
1648795783.114 2557892 0.15513698095279177%
2116292380.702 1948901 0.092090347145394241%
1704227210.803 1980402 0.11620527987385389%
1984573918.515 2325530 0.11718031655581404%
1856414008.018 1798348 0.096872141248277158%
1401593348.875 2285205 0.16304336788086488%
1832475317.969 2373585 0.12952889333487622%
1840826298.949 2112188 0.11474129857911804%
2014475390.352 2104652 0.10447643143618861%
1730176303.748 1975645 0.11418749613667999%
2016553863.650 2009045 0.099627638825555156%
1891795883.646 2160557 0.11420666567029553%
1881160108.475 1492908 0.079361027978115894%
1562754135.542 1978972 0.12663361145505117%
2061996559.038 2071212 0.10044691834821971%
1519528906.706 2460493 0.16192472477103448%
1796244525.568 2099254 0.1168690548596764%
1998463642.677 1988093 0.099481069234609226%
1872169708.204 1762657 0.094150492462082555%
1664425938.102 1710870 0.10279039522485223%
1791543540.761 1992110 0.11119517637588673%
1883913354.514 1770358 0.093972368514618113%
1813806740.237 1877185 0.10349421238531283%
1547586337.236 2702003 0.17459465329900728%
2020313568.921 2100334 0.10396079263684484%
1727289498.314 2059429 0.1192289423405978%

Detailed time after patch:

Total add_to_page_cache_lru percent
2151489401.227 6287717 0.29224949918015392%
1919030541.464 6872841 0.3581413037208262%
2277104485.950 8538296 0.37496285535785828%
2142847301.255 7992466 0.37298345968558083%
1909650847.324 6960671 0.36449966808089612%
2109123648.836 6825371 0.32361170497361974%
2128503623.032 7728883 0.36311345286743746%
1925232499.136 7526064 0.39091714914315673%
1935780958.829 7499162 0.38739723964101919%
1722848267.832 7353451 0.42681942091473007%
1855159456.452 7600681 0.40970499724785564%
1862769335.900 8607712 0.46209221045831578%
2272365556.764 6895008 0.30342864419310028%
2049429274.204 7252089 0.35385895435775488%
2001218334.232 8155639 0.4075336938750293%
1947131845.644 7229822 0.37130623774523025%
1982915408.746 7513518 0.37891268416495716%
2038654062.353 7531559 0.36943781385386831%
2134879530.359 6738727 0.31564905205057731%
2029761134.284 7472398 0.36814174208907074%
1956557094.935 7280443 0.37210480690019776%
1740007857.373 6663035 0.38293131676196024%
2054278081.418 7765213 0.37800203732105886%
2367585922.353 7629987 0.32226864199365651%
2023763123.141 7730698 0.38199618876350999%
1826074987.177 7579606 0.41507638258150648%
2077873224.783 7414675 0.35683962387909118%
2103719398.720 7093390 0.33718327664402131%
2164758918.301 7181094 0.33172719323572736%
1953199921.042 7750654 0.39681826302067191%
2179298642.359 6395109 0.29344803303679207%
1971922836.328 7215479 0.36591081897688477%
2225552712.785 8153246 0.36634701812105075%
2124780733.448 7080842 0.33325048032177529%
2139511762.563 8708060 0.40701155059639843%
1934365642.348 6842114 0.35371358186939283%
1950765025.910 6818753 0.34954250816646476%
2013188877.577 6495929 0.32266863146086228%
1921391492.273 6536576 0.34020011154870117%
2164698840.945 6712421 0.31008567441509277%
1992440404.239 7467130 0.3747730664422067%
2073623249.374 7113197 0.34303227465005426%
1938481781.041 6902169 0.35606055561138211%
1963963682.421 7776006 0.39593430721765843%
1807704433.895 7206899 0.39867684477995868%
1902413226.299 7687850 0.40411041585093088%
1718395340.661 7096452 0.41296969516166576%
2014139552.830 7653595 0.37999328245384933%
1976902280.499 7781174 0.3936043818026207%
1920129734.441 6808420 0.3545812492707483%
1786139514.638 7179970 0.4019825966089316%
2011990483.317 6739406 0.33496212113733792%
2186643920.919 7830962 0.35812698743876015%
1822975832.465 8627042 0.47323951565143058%
1872249428.425 7264440 0.38800599373722838%
1971668701.573 6818269 0.34581210294408871%
2054585589.388 7261214 0.35341501651254642%
2252858576.565 7228902 0.32087686618225808%
2196771348.820 6634658 0.30201859668116204%
2224872312.502 6948714 0.31231967609798522%
1924947573.139 6935037 0.360271474235066%
2070207954.999 7564561 0.36540102078796494%
2070702822.858 6994172 0.33776802362912656%
2099826181.510 7199665 0.34286957003377627%
2369037401.585 7009501 0.29587971027009985%
2109093017.803 7056062 0.33455432929886414%
2141046841.968 7831694 0.36578807368835009%
2284785719.942 7358167 0.3220506385249462%
2088247084.213 6857813 0.32840045853982414%
1880913709.753 6697468 0.35607523967059107%
1984586172.943 7361786 0.37094816543455983%
2130320004.839 7840067 0.36802297223850727%
2314059227.236 6669054 0.28819720435443524%
1935408704.712 8309322 0.42933164348025787%
2239440115.241 7040432 0.31438357972087749%
2387500130.769 6555638 0.27458168129559291%
2497500671.323 6969731 0.27906823329532587%
2229438175.095 6759889 0.30321042653322067%
1979317661.013 7870099 0.39761677243723181%
2228213884.202 5883791 0.26405862748257614%
2264526373.743 6563122 0.28982316461838858%
2256376049.198 6670128 0.29561242694323103%
2085817937.654 6878115 0.32975624937505721%
2177225113.484 7421724 0.34087995559282069%
1843473456.551 7156295 0.38819625932607071%
2403022978.094 6958096 0.28955594946157931%
1935191866.634 6872608 0.35513832599730566%
1917810458.850 7816937 0.40759695328220108%
2016838266.381 7257719 0.35985627211562177%
1904575178.166 7484129 0.39295529448235278%
2087232238.930 7372521 0.35321996577532039%
1976631116.176 6438233 0.32571747693901715%
2004191806.805 6735567 0.33607397141980955%
2110550986.328 7174599 0.33993961986592813%
2152621184.889 7337361 0.34085704681840484%
2059135774.936 6363510 0.30903790208772336%
2353978555.486 7265456 0.30864580236160993%
1945677192.325 5927833 0.30466682877217138%
2255676557.335 6379740 0.28283044301074045%
2034627875.756 7319043 0.35972391252530572%
1712556465.847 3079747 0.17983331127577226%
1837380343.831 3111263 0.1693314620702272%
1680363376.083 1611968 0.095929727042585719%
1769978476.954 2017096 0.11396161175198416%
1829700109.543 2612874 0.14280340184559598%
1783959966.875 3184558 0.17851062014460206%
1772032973.967 2133741 0.12041203698502599%
1667688494.152 1985120 0.11903422053705601%
2001252356.720 2633874 0.13161128785961057%
2077009766.270 2858188 0.13761071548223289%
2076612330.379 2340135 0.11269002720276176%
1834812072.536 1713713 0.093399919569494547%
1787949752.795 2522212 0.14106727529994451%
1907464219.876 2476303 0.12982172741153589%
1609254879.065 1903707 0.11829741980375917%
2069557067.911 2484547 0.12005211349440528%
1725151836.363 2491288 0.1444097816486799%
2077004221.561 2508662 0.12078271069254648%
1650748196.246 3195100 0.19355465644403197%
1843881436.725 2731880 0.14815920078094685%
1677080490.297 2752567 0.16412849686853959%
2103777441.563 3252694 0.15461207710181613%
1598451912.768 2345538 0.1467381021139566%
1825835416.807 2381575 0.13043755083713246%
1969692431.120 2198750 0.11162910336969484%
1867823718.088 1813348 0.097083465770326324%
1929972836.031 2140104 0.11088777831718792%
1627140530.277 2519227 0.15482541016731566%
1779407314.725 2828755 0.15897175293095681%
1724839045.836 1937500 0.11232932166496305%
1780135194.997 2889504 0.16231935687361485%
1758590417.053 2193512 0.12473126082853507%
1729351400.188 1523643 0.088104881392778986%
1484255938.207 2153531 0.14509162096406991%
1504352024.282 1910774 0.12701641432044322%
1916441202.247 2423510 0.12645887581411155%
1651813655.113 2892863 0.1751325272705837%
1772411947.023 2693122 0.15194673024651262%
1738795901.180 3064160 0.17622309771495132%
1717902110.376 3266140 0.19012375503078779%
1858583685.387 2246750 0.12088505982619635%
1641867344.242 2901893 0.17674345069209688%
2037933222.333 3290649 0.16146991294606342%
1713754879.753 2136278 0.1246548164640615%
2265307843.106 2195967 0.096939010151886218%
2066343912.900 2117667 0.10248376307446183%
2033259596.459 2652081 0.13043494321230309%
2029016228.457 2447926 0.12064595470789147%
1649943910.454 2243384 0.13596728869302646%
2064977308.315 2517573 0.12191770775700743%
1962826877.902 2403546 0.12245328546596376%
1991037543.226 3214700 0.16145853255942866%
1469646744.644 1960481 0.133398111290676%
2195532272.023 1693642 0.077140382839349028%
1902754342.931 2542547 0.13362455376575119%
1922726134.774 2726990 0.14182935108023245%
1629134454.130 2814497 0.17276026499009955%
1442857977.322 2886065 0.2000241912483062%
1782430811.270 2884583 0.16183421997427802%
1574026949.337 2873023 0.18252692568003068%
2064769936.712 2717133 0.13159495165484839%
1621381781.294 2143854 0.13222388611576869%
1783777994.311 2515565 0.14102455619605617%
1561198792.593 2402227 0.15387066729728466%
1658682002.165 2924639 0.17632306832669586%
1675144757.604 2216513 0.13231769910860314%
1712859849.041 2274046 0.13276310967725691%
1448178253.374 2507727 0.17316424923226392%
1621787692.726 2822681 0.17404750403892047%
1924315582.555 2275728 0.11826168330344308%
2044659318.750 2606738 0.12749008972280165%
1559320878.499 2049522 0.13143683434629996%
1641314354.590 2786145 0.16975084585158451%
1769603624.855 2008994 0.11352790940200608%
1707747444.550 2813289 0.16473682973311757%
1366611645.499 2930513 0.21443641356721807%
1435165128.768 3161655 0.22029903992400404%
1440555681.860 2831258 0.19653929630435177%
1627011278.312 1956565 0.12025515901954331%
1668464384.654 3863649 0.23156916237089648%
1589045506.115 2743558 0.17265446391825656%
1766150748.153 3207045 0.18158387687765917%
1668645748.936 2997211 0.17961937109246526%
1808149919.182 2420007 0.13383884678626659%
1831096162.185 2226869 0.12161398434382248%
1473136553.216 2774496 0.18833936296964501%
1873520917.567 2427200 0.12955286366121929%
1971938578.205 2207736 0.11195764535473715%
1670588149.850 2437810 0.14592525394238479%
1912140400.200 2407058 0.12588291109524352%
1925513187.452 2415130 0.12542786077699639%
1658671061.956 3111210 0.18757245311382492%
1917423750.996 2888924 0.15066695603928745%
1553962205.772 3386946 0.21795549386076499%
1834101490.364 2367281 0.12907033838842713%
1626846627.678 3362957 0.20671629044711806%
1519340210.104 2320735 0.15274623712098975%
1725615314.262 3813557 0.22099693764197714%
1725823077.103 3137784 0.1818137699993643%
1633124107.941 2007344 0.12291435722731486%

Thanks,
Fengguang

2009-09-26 11:09:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, 26 Sep 2009, Wu Fengguang wrote:

> The swap cache and page cache code assume that they 'own' the newly
> allocated page and therefore can disregard the locking rules. However
> now hwpoison can hit any time on any page.
>
> So use the safer lock_page()/trylock_page(). The main intention is not
> to close such a small time window of memory corruption. But to avoid
> kernel oops that may result from such races, and also avoid raising
> false alerts in hwpoison stress tests.
>
> This in theory will slightly increase page cache/swap cache overheads,
> however it seems to be too small to be measurable in benchmark.

No.

But I'd most certainly defer to Nick if he disagrees with me.

I don't think anyone would want to quarrel very long over the swap
and migration mods alone, but add_to_page_cache() is of a higher
order of magnitude.

I can't see any reason to surrender add_to_page_cache() optimizations
to the remote possibility of hwpoison (infinitely remote for most of
us); though I wouldn't myself want to run the benchmark to defend them.

You'd later be sending a patch to replace __SetPageUptodate()s by
SetPageUptodate()s etc, wouldn't you? Because any non-atomic op
on page->flags might wipe your locked bit (or your hwpoison bit).

You could make #ifdef CONFIG_HWPOISON_INJECT select the slower
versions of these things; or #ifdef CONFIG_MEMORY_FAILURE? that
would pose distros with a harder choice. But I'd much prefer
not to go that way.

Please accept that there will be quite a number of places where
the code "knows" it's the only user of the page, and hwpoison
handling and testing should work around those places (can shift
things around slightly to suit itself better, but not add cost).

Look into why you think you want the page lock: I can see it's
going to be useful if you're taking a page out of a file (but then
why bother if page->mapping not set?), or if you're using rmap-style
lookup (but then why bother if !page_mapped?).

I suspect if memory_failure() did something like:
if (page->mapping)
lock_page_nosync(p);
then you'd be okay, perhaps with a few additional _inexpensive_
tweaks here and there. With the "necessary" memory barriers?
no, we probably wouldn't want to be adding any in hot paths.

But I definitely say "something like": remember that page_mapping()
does that weird thing with PageSwapCache (a mistake from day one in
my opinion), which might or might not be what you want. There are
probably various reasons why it's not as simple as I suggest above.

It seems to me that the Intel hardware guys have done half a job
here: the sooner they get to remapping the bad pages, the better.

Hugh

>
> CC: Hugh Dickins <[email protected]>
> CC: Andi Kleen <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> include/linux/pagemap.h | 13 ++++---------
> mm/migrate.c | 2 +-
> mm/swap_state.c | 4 ++--
> 3 files changed, 7 insertions(+), 12 deletions(-)
>
> --- sound-2.6.orig/mm/swap_state.c 2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/mm/swap_state.c 2009-09-25 18:42:23.000000000 +0800
> @@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
> * re-using the just freed swap entry for an existing page.
> * May fail (-ENOMEM) if radix-tree node allocation failed.
> */
> - __set_page_locked(new_page);
> + lock_page(new_page);
> SetPageSwapBacked(new_page);
> err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (likely(!err)) {
> @@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
> return new_page;
> }
> ClearPageSwapBacked(new_page);
> - __clear_page_locked(new_page);
> + unlock_page(new_page);
> swapcache_free(entry, NULL);
> } while (err != -ENOMEM);
>
> --- sound-2.6.orig/include/linux/pagemap.h 2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/include/linux/pagemap.h 2009-09-25 18:42:19.000000000 +0800
> @@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
> extern void __lock_page_nosync(struct page *page);
> extern void unlock_page(struct page *page);
>
> -static inline void __set_page_locked(struct page *page)
> -{
> - __set_bit(PG_locked, &page->flags);
> -}
> -
> static inline void __clear_page_locked(struct page *page)
> {
> __clear_bit(PG_locked, &page->flags);
> @@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
> extern void __remove_from_page_cache(struct page *page);
>
> /*
> - * Like add_to_page_cache_locked, but used to add newly allocated pages:
> - * the page is new, so we can just run __set_page_locked() against it.
> + * Like add_to_page_cache_locked, but used to add newly allocated pages.
> */
> static inline int add_to_page_cache(struct page *page,
> struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
> {
> int error;
>
> - __set_page_locked(page);
> + if (!trylock_page(page))
> + return -EIO; /* hwpoisoned */
> error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
> if (unlikely(error))
> - __clear_page_locked(page);
> + unlock_page(page);
> return error;
> }
>
> --- sound-2.6.orig/mm/migrate.c 2009-09-14 10:50:19.000000000 +0800
> +++ sound-2.6/mm/migrate.c 2009-09-25 18:42:19.000000000 +0800
> @@ -551,7 +551,7 @@ static int move_to_new_page(struct page
> * holding a reference to the new page at this point.
> */
> if (!trylock_page(newpage))
> - BUG();
> + return -EAGAIN; /* got by hwpoison */
>
> /* Prepare mapping for the new page.*/
> newpage->index = page->index;

2009-09-26 11:32:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 06:52:59PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> > On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > > The swap cache and page cache code assume that they 'own' the newly
> > > allocated page and therefore can disregard the locking rules. However
> > > now hwpoison can hit any time on any page.
> > >
> > > So use the safer lock_page()/trylock_page(). The main intention is not
> > > to close such a small time window of memory corruption. But to avoid
> > > kernel oops that may result from such races, and also avoid raising
> > > false alerts in hwpoison stress tests.
> > >
> > > This in theory will slightly increase page cache/swap cache overheads,
> > > however it seems to be too small to be measurable in benchmark.
> >
> > Thanks. Can you please describe what benchmarks you used?
>
> Good question. The test case is to copy 1GB sparse file over NFS.
>
> I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.

And standard deviation is 0.04%, much larger than the difference 0.008% ..

Thanks,
Fengguang

> The average time reported by the function profiler:
>
> total time add_to_page_cache_lru percent
> before 1940083424.361 4841664.805 0.242%
> after 1914258447.188 4889044.435 0.250%
>
> The script used is:
>
> for i in `seq 1 100`
> do
> # copy 1GB sparse file over NFS
> echo 1 > /debug/tracing/function_profile_enabled
> cp /tmp/sparse/1G /dev/null
> echo 0 > /debug/tracing/function_profile_enabled
>
> fadvise /tmp/sparse/1G 0 0 dontneed
> cat /debug/tracing/trace_stat/function0 > function0.$i
> cat /debug/tracing/trace_stat/function1 > function1.$i
> done
>
>
> File function0.1 after patch:
>
> Function Hit Time Avg
> -------- --- ---- ---
> schedule 32492 474358128 us 14599.22 us
> schedule_timeout 26 99994288 us 3845934 us
> poll_schedule_timeout 8 52914376 us 6614297 us
> sys_poll 6 52914373 us 8819062 us
> schedule_hrtimeout_range 8 52914373 us 6614296 us
> do_sys_poll 6 52914355 us 8819059 us
> sys_read 86997 40703194 us 467.868 us
> vfs_read 87002 40550651 us 466.088 us
> do_sync_read 86987 40011054 us 459.965 us
> nfs_file_read 86985 39945631 us 459.224 us
> generic_file_aio_read 86985 39785858 us 457.387 us
> page_cache_async_readahead 2886 34379069 us 11912.35 us
> ondemand_readahead 2889 34371850 us 11897.49 us
> ra_submit 2891 34368378 us 11888.05 us
> __do_page_cache_readahead 2891 34360744 us 11885.41 us
> nfs_readpages 2876 33198107 us 11543.15 us
> read_cache_pages 2876 31617284 us 10993.49 us
> __rpc_execute 84468 25404441 us 300.758 us
> readpage_async_filler 174071 24947004 us 143.315 us
> do_softirq 44822 24829675 us 553.961 us
> __do_softirq 44822 24803408 us 553.375 us
> net_rx_action 38424 24357207 us 633.906 us
> e1000_clean 39042 24253141 us 621.206 us
> nfs_pageio_doio 43534 22993433 us 528.171 us
> nfs_pagein_one 43520 22939752 us 527.108 us
> nfs_read_rpcsetup 43513 22533176 us 517.849 us
> e1000_clean_rx_irq 39042 22047306 us 564.707 us
> rpc_run_task 43528 22042766 us 506.404 us
> call_transmit 45873 21842655 us 476.154 us
> nfs_pageio_add_request 174091 21583666 us 123.979 us
> rpc_execute 43528 21254402 us 488.292 us
> cpuidle_idle_call 18073 19445493 us 1075.941 us
> acpi_idle_enter_c1 17999 19396945 us 1077.667 us
> e1000_receive_skb 436279 18818686 us 43.134 us
> tcp_v4_do_rcv 513521 18681598 us 36.379 us
> xprt_transmit 43346 18532249 us 427.542 us
> tcp_rcv_established 513521 18384627 us 35.801 us
> napi_gro_receive 436279 18114586 us 41.520 us
> napi_skb_finish 436279 17130261 us 39.264 us
> netif_receive_skb 436279 16886046 us 38.704 us
> ip_rcv 436279 16500338 us 37.820 us
> xs_tcp_send_request 43346 16418581 us 378.779 us
> xs_sendpages 43346 16367619 us 377.603 us
> xs_send_kvec 43347 16310325 us 376.273 us
> kernel_sendmsg 43347 16254698 us 374.990 us
> sock_sendmsg 43347 16177184 us 373.201 us
> tcp_sendmsg 43347 16127746 us 372.061 us
> do_IRQ 45271 15968995 us 352.742 us
> ip_local_deliver 436279 15816133 us 36.252 us
> irq_exit 53709 15421475 us 287.130 us
> acpi_processor_ffh_cstate_ente 17999 15394735 us 855.310 us
> mwait_idle_with_hints 17999 15386089 us 854.830 us
> tcp_v4_rcv 436217 15173852 us 34.785 us
> release_sock 43364 13536313 us 312.155 us
> xs_tcp_data_ready 512045 10092920 us 19.711 us
> _spin_unlock_bh 518900 9851221 us 18.984 us
> local_bh_enable_ip 511352 9403704 us 18.389 us
> tcp_read_sock 1024090 8832051 us 8.624 us
> __tcp_ack_snd_check 512045 6644838 us 12.977 us
> add_to_page_cache_lru 174072 6287717 us 36.121 us
> tcp_transmit_skb 296572 5962431 us 20.104 us
> tcp_send_ack 253210 5775135 us 22.807 us
> __kfree_skb 802932 5216451 us 6.496 us
> add_to_page_cache_locked 174072 4716425 us 27.094 us
> [...]
>
> Detailed time before patch (100 runs * 2 CPU):
>
> Total add_to_page_cache_lru percent
> 1927694859.404 6703561 0.34775011030908648%
> 1994166534.446 7311783 0.36665859514242066%
> 2132812997.930 6249683 0.29302536162643533%
> 1972582396.980 7908593 0.40092586307715006%
> 2415371209.622 7670327 0.31756307144194151%
> 2163473353.153 7406991 0.34236571433640306%
> 1833544521.635 7952113 0.43370165851816772%
> 2087748556.683 7995535 0.38297404035585819%
> 2185868844.795 7117368 0.32560819085499598%
> 1980423680.193 6514699 0.32895481230385598%
> 1958021484.352 6745228 0.34449203207963308%
> 2008592063.702 7398967 0.36836583862444905%
> 2448117568.931 6224454 0.25425470079519025%
> 2213021186.789 6938228 0.31351837214297412%
> 1828902235.789 7479842 0.40897987074596959%
> 1866644864.837 7720792 0.41361868802366963%
> 2054526163.572 8475856 0.41254553727677401%
> 1902862587.026 8334661 0.43800645705196778%
> 2148449373.925 8672498 0.40366313050031155%
> 2104880443.231 7741619 0.36779376353160392%
> 2270646549.540 6486688 0.2856758134071597%
> 2250814274.826 7512331 0.33376058984612328%
> 2214008898.106 6391259 0.28867359139646992%
> 2446226035.589 7738434 0.31634173978271579%
> 1920377000.358 6464992 0.33665223020244384%
> 2030446733.714 6022617 0.2966153654759367%
> 2157128575.464 7375670 0.34192074055731642%
> 2185670134.683 7943199 0.36342167438509865%
> 2224381465.969 8492920 0.38181041021667828%
> 2479045291.455 8161934 0.32923698603382928%
> 1971641972.432 8114434 0.41155717485517562%
> 2208169216.013 8138519 0.36856410011433138%
> 1971312194.985 8122576 0.41203904793283169%
> 2279129743.345 7365286 0.3231622079219687%
> 2360162707.877 7412468 0.3140659741491984%
> 1885022652.013 8252339 0.43778460652382062%
> 2065345722.062 7985361 0.3866355600759942%
> 1975982196.916 7925371 0.4010851419799969%
> 2169738273.075 7203856 0.33201497569522709%
> 2034570957.887 7396992 0.36356520136717829%
> 1789538123.758 7380292 0.41241323121418116%
> 1912812047.546 8697404 0.45469203370807615%
> 2221047646.893 7451321 0.33548676951723994%
> 2163981380.559 7982481 0.36887937538251664%
> 1969189605.396 7353112 0.37340802428831149%
> 2278909389.270 7088803 0.31106120468750831%
> 2104996788.904 7342692 0.34882200479854836%
> 2374705789.293 7327329 0.30855733931492624%
> 2137944282.456 8112283 0.3794431438915179%
> 2268117031.230 6889022 0.30373309247909852%
> 1899091967.945 8173918 0.43041190937398177%
> 2098779842.987 7858621 0.37443760603377763%
> 2124662819.215 8156209 0.38388251190904155%
> 2317714239.959 7048899 0.30413149638864423%
> 2218944220.453 6659501 0.30012025262358577%
> 1945126557.801 7704643 0.3960998305791596%
> 2092060468.851 7616315 0.36405807161888715%
> 1933379809.531 7300893 0.37762331870896354%
> 1903832170.769 7508856 0.39440745435912067%
> 2435478683.656 6790800 0.27882814354203433%
> 2126189870.763 7104836 0.33415811530747125%
> 2161363441.772 7011298 0.32439236569356278%
> 2064449342.765 6822211 0.33046153561039859%
> 2116816508.947 8156178 0.38530396779913867%
> 1982774633.856 7131335 0.35966442571092105%
> 2323253766.182 7275361 0.3131539526978242%
> 1905927501.820 7477458 0.39232646534874271%
> 2252162684.001 8064426 0.35807475442552977%
> 2027146552.281 7875866 0.38851981328818402%
> 1845397260.011 7598767 0.41176862915439172%
> 2177206310.501 7366288 0.33833670077434846%
> 2041629016.560 6600368 0.32328929234759562%
> 2189308825.112 8348863 0.38134697600613249%
> 1930990433.231 7344038 0.38032492930126544%
> 2102320863.469 7299411 0.34720727586536809%
> 1895798071.683 8051595 0.42470741585110772%
> 2019648719.623 7992832 0.39575357448754711%
> 1961118039.267 7837733 0.39965636147681766%
> 2142238585.905 8084509 0.37738602288244455%
> 2263212432.641 7757018 0.34274369865263332%
> 2200111036.549 7069702 0.32133387281622078%
> 1941699880.568 7786856 0.4010329339734075%
> 1939777157.562 7729686 0.39848319534368681%
> 1983015138.927 8136252 0.41029701893261827%
> 2040218650.019 7968023 0.39054750332400873%
> 1850721920.862 7835912 0.42339758943095634%
> 2167986192.252 8656589 0.39929170356052646%
> 2128868496.040 8248906 0.38747841942065214%
> 1984840247.084 7754853 0.39070413910605312%
> 2130780080.168 7053725 0.33103955990821227%
> 2062729676.752 7968701 0.38631824081513272%
> 1995756518.448 7903687 0.39602461156666058%
> 1921150053.008 8307746 0.43243608103345821%
> 2253731991.230 8033382 0.35644797301810893%
> 2143807312.838 6755593 0.31512127790332323%
> 1948787937.040 7152297 0.36701258582622248%
> 2276500726.783 7277806 0.31969267193182554%
> 2282733838.581 6336839 0.27759867983291148%
> 2003104438.909 7092547 0.35407774363791977%
> 2239684291.880 6982252 0.3117516171950766%
> 1801833773.742 2645976 0.14684906224756283%
> 2016683111.150 2261656 0.11214731692329717%
> 1962727948.601 2767184 0.14098663046870061%
> 1845203884.375 2266317 0.1228220371304734%
> 1516914718.568 1975612 0.13023883121557553%
> 1557730294.184 2109274 0.13540688063108641%
> 2171303320.533 1865384 0.085910797554627022%
> 1617322895.641 2073254 0.12819048104666192%
> 1867107731.934 1901409 0.10183713384500154%
> 1772588329.145 2045529 0.11539786008783279%
> 2030802387.562 2161163 0.10641916777508319%
> 1853583237.019 1803299 0.097287187539531861%
> 1453728764.040 3231016 0.22225714176699726%
> 1989811798.738 2418414 0.1215398361560542%
> 1956758511.567 2254818 0.11523230826241863%
> 2001579398.410 1952360 0.097540971971978793%
> 1636515132.388 1532408 0.093638486419854422%
> 1926271619.408 1918912 0.09961793449408439%
> 1642519133.238 1384193 0.084272564744573439%
> 1677033195.955 1340416 0.07992781557533149%
> 1693983317.244 3135804 0.18511421972571418%
> 1487434932.071 2467493 0.1658891388656871%
> 1720442169.343 3034092 0.17635536108479918%
> 1640362983.552 1944463 0.11853858075908964%
> 1821382248.526 2659828 0.1460334864991977%
> 1947322513.836 3540608 0.18181929160904178%
> 1673422299.278 2769832 0.16551900863249205%
> 1836039325.613 2175872 0.11850900847526999%
> 1582929718.539 2072026 0.13089816785500888%
> 1476014468.174 1996943 0.13529291501257767%
> 1847131237.607 2120975 0.11482535495137695%
> 1734796451.802 1842211 0.10619176665287874%
> 1804827020.363 1988047 0.11015166426310204%
> 1647250758.752 2634152 0.15991202225917941%
> 1506833507.228 2271966 0.15077750720977479%
> 1919459367.994 1608374 0.083793073550751385%
> 1913364635.281 2126200 0.11112361756846952%
> 1733934046.047 2031440 0.11715785872197679%
> 1762796865.803 2285487 0.12965118354456009%
> 1711072016.023 2270996 0.13272357789349024%
> 2082560830.257 2273483 0.10916766353083858%
> 1857948273.729 1266725 0.06817870109255604%
> 1738287034.188 2404600 0.1383315846409251%
> 1618942973.292 2016207 0.12453848179100423%
> 1820789056.854 2369799 0.13015230902664757%
> 1561747605.669 2487149 0.15925422206327566%
> 1884502059.843 2512148 0.13330566485076115%
> 1352203145.315 2618029 0.19361210695824271%
> 1911003204.592 1648292 0.086252707271200574%
> 1525244016.511 2555354 0.16753738892517536%
> 2190575134.148 1939935 0.088558249829422828%
> 1643983673.168 1747429 0.10629235730989092%
> 1781900292.843 1686195 0.094629032094141294%
> 1566821289.746 2291057 0.14622324926229507%
> 1777199439.264 2456866 0.13824368530171682%
> 1633063412.403 1953080 0.11959609070698031%
> 1913601422.874 1821650 0.095194849785599556%
> 2094732749.709 2322465 0.11087166132876074%
> 1951529182.943 2270128 0.11632559839953495%
> 1450396654.276 2867633 0.19771370759479912%
> 1643954213.946 2460211 0.14965203891504558%
> 1845971802.604 2593942 0.14051904781757141%
> 1671974226.831 2552423 0.15265923116756236%
> 1792483105.554 1547320 0.086322710390164162%
> 1774273751.516 2101959 0.1184686973024323%
> 1797523443.598 1781759 0.099122990932099045%
> 1877038808.523 2131598 0.11356174365288201%
> 1485614991.186 1796722 0.12094129439052281%
> 2008042349.453 1972441 0.09822706182154485%
> 2178564679.489 1948988 0.089462021410222761%
> 1462911389.586 2076158 0.14191960051575969%
> 1976013519.965 2692701 0.13626936115536772%
> 1473561005.083 1369873 0.092963439944099252%
> 2190713955.687 2450228 0.11184609444967988%
> 1648795783.114 2557892 0.15513698095279177%
> 2116292380.702 1948901 0.092090347145394241%
> 1704227210.803 1980402 0.11620527987385389%
> 1984573918.515 2325530 0.11718031655581404%
> 1856414008.018 1798348 0.096872141248277158%
> 1401593348.875 2285205 0.16304336788086488%
> 1832475317.969 2373585 0.12952889333487622%
> 1840826298.949 2112188 0.11474129857911804%
> 2014475390.352 2104652 0.10447643143618861%
> 1730176303.748 1975645 0.11418749613667999%
> 2016553863.650 2009045 0.099627638825555156%
> 1891795883.646 2160557 0.11420666567029553%
> 1881160108.475 1492908 0.079361027978115894%
> 1562754135.542 1978972 0.12663361145505117%
> 2061996559.038 2071212 0.10044691834821971%
> 1519528906.706 2460493 0.16192472477103448%
> 1796244525.568 2099254 0.1168690548596764%
> 1998463642.677 1988093 0.099481069234609226%
> 1872169708.204 1762657 0.094150492462082555%
> 1664425938.102 1710870 0.10279039522485223%
> 1791543540.761 1992110 0.11119517637588673%
> 1883913354.514 1770358 0.093972368514618113%
> 1813806740.237 1877185 0.10349421238531283%
> 1547586337.236 2702003 0.17459465329900728%
> 2020313568.921 2100334 0.10396079263684484%
> 1727289498.314 2059429 0.1192289423405978%
>
> Detailed time after patch:
>
> Total add_to_page_cache_lru percent
> 2151489401.227 6287717 0.29224949918015392%
> 1919030541.464 6872841 0.3581413037208262%
> 2277104485.950 8538296 0.37496285535785828%
> 2142847301.255 7992466 0.37298345968558083%
> 1909650847.324 6960671 0.36449966808089612%
> 2109123648.836 6825371 0.32361170497361974%
> 2128503623.032 7728883 0.36311345286743746%
> 1925232499.136 7526064 0.39091714914315673%
> 1935780958.829 7499162 0.38739723964101919%
> 1722848267.832 7353451 0.42681942091473007%
> 1855159456.452 7600681 0.40970499724785564%
> 1862769335.900 8607712 0.46209221045831578%
> 2272365556.764 6895008 0.30342864419310028%
> 2049429274.204 7252089 0.35385895435775488%
> 2001218334.232 8155639 0.4075336938750293%
> 1947131845.644 7229822 0.37130623774523025%
> 1982915408.746 7513518 0.37891268416495716%
> 2038654062.353 7531559 0.36943781385386831%
> 2134879530.359 6738727 0.31564905205057731%
> 2029761134.284 7472398 0.36814174208907074%
> 1956557094.935 7280443 0.37210480690019776%
> 1740007857.373 6663035 0.38293131676196024%
> 2054278081.418 7765213 0.37800203732105886%
> 2367585922.353 7629987 0.32226864199365651%
> 2023763123.141 7730698 0.38199618876350999%
> 1826074987.177 7579606 0.41507638258150648%
> 2077873224.783 7414675 0.35683962387909118%
> 2103719398.720 7093390 0.33718327664402131%
> 2164758918.301 7181094 0.33172719323572736%
> 1953199921.042 7750654 0.39681826302067191%
> 2179298642.359 6395109 0.29344803303679207%
> 1971922836.328 7215479 0.36591081897688477%
> 2225552712.785 8153246 0.36634701812105075%
> 2124780733.448 7080842 0.33325048032177529%
> 2139511762.563 8708060 0.40701155059639843%
> 1934365642.348 6842114 0.35371358186939283%
> 1950765025.910 6818753 0.34954250816646476%
> 2013188877.577 6495929 0.32266863146086228%
> 1921391492.273 6536576 0.34020011154870117%
> 2164698840.945 6712421 0.31008567441509277%
> 1992440404.239 7467130 0.3747730664422067%
> 2073623249.374 7113197 0.34303227465005426%
> 1938481781.041 6902169 0.35606055561138211%
> 1963963682.421 7776006 0.39593430721765843%
> 1807704433.895 7206899 0.39867684477995868%
> 1902413226.299 7687850 0.40411041585093088%
> 1718395340.661 7096452 0.41296969516166576%
> 2014139552.830 7653595 0.37999328245384933%
> 1976902280.499 7781174 0.3936043818026207%
> 1920129734.441 6808420 0.3545812492707483%
> 1786139514.638 7179970 0.4019825966089316%
> 2011990483.317 6739406 0.33496212113733792%
> 2186643920.919 7830962 0.35812698743876015%
> 1822975832.465 8627042 0.47323951565143058%
> 1872249428.425 7264440 0.38800599373722838%
> 1971668701.573 6818269 0.34581210294408871%
> 2054585589.388 7261214 0.35341501651254642%
> 2252858576.565 7228902 0.32087686618225808%
> 2196771348.820 6634658 0.30201859668116204%
> 2224872312.502 6948714 0.31231967609798522%
> 1924947573.139 6935037 0.360271474235066%
> 2070207954.999 7564561 0.36540102078796494%
> 2070702822.858 6994172 0.33776802362912656%
> 2099826181.510 7199665 0.34286957003377627%
> 2369037401.585 7009501 0.29587971027009985%
> 2109093017.803 7056062 0.33455432929886414%
> 2141046841.968 7831694 0.36578807368835009%
> 2284785719.942 7358167 0.3220506385249462%
> 2088247084.213 6857813 0.32840045853982414%
> 1880913709.753 6697468 0.35607523967059107%
> 1984586172.943 7361786 0.37094816543455983%
> 2130320004.839 7840067 0.36802297223850727%
> 2314059227.236 6669054 0.28819720435443524%
> 1935408704.712 8309322 0.42933164348025787%
> 2239440115.241 7040432 0.31438357972087749%
> 2387500130.769 6555638 0.27458168129559291%
> 2497500671.323 6969731 0.27906823329532587%
> 2229438175.095 6759889 0.30321042653322067%
> 1979317661.013 7870099 0.39761677243723181%
> 2228213884.202 5883791 0.26405862748257614%
> 2264526373.743 6563122 0.28982316461838858%
> 2256376049.198 6670128 0.29561242694323103%
> 2085817937.654 6878115 0.32975624937505721%
> 2177225113.484 7421724 0.34087995559282069%
> 1843473456.551 7156295 0.38819625932607071%
> 2403022978.094 6958096 0.28955594946157931%
> 1935191866.634 6872608 0.35513832599730566%
> 1917810458.850 7816937 0.40759695328220108%
> 2016838266.381 7257719 0.35985627211562177%
> 1904575178.166 7484129 0.39295529448235278%
> 2087232238.930 7372521 0.35321996577532039%
> 1976631116.176 6438233 0.32571747693901715%
> 2004191806.805 6735567 0.33607397141980955%
> 2110550986.328 7174599 0.33993961986592813%
> 2152621184.889 7337361 0.34085704681840484%
> 2059135774.936 6363510 0.30903790208772336%
> 2353978555.486 7265456 0.30864580236160993%
> 1945677192.325 5927833 0.30466682877217138%
> 2255676557.335 6379740 0.28283044301074045%
> 2034627875.756 7319043 0.35972391252530572%
> 1712556465.847 3079747 0.17983331127577226%
> 1837380343.831 3111263 0.1693314620702272%
> 1680363376.083 1611968 0.095929727042585719%
> 1769978476.954 2017096 0.11396161175198416%
> 1829700109.543 2612874 0.14280340184559598%
> 1783959966.875 3184558 0.17851062014460206%
> 1772032973.967 2133741 0.12041203698502599%
> 1667688494.152 1985120 0.11903422053705601%
> 2001252356.720 2633874 0.13161128785961057%
> 2077009766.270 2858188 0.13761071548223289%
> 2076612330.379 2340135 0.11269002720276176%
> 1834812072.536 1713713 0.093399919569494547%
> 1787949752.795 2522212 0.14106727529994451%
> 1907464219.876 2476303 0.12982172741153589%
> 1609254879.065 1903707 0.11829741980375917%
> 2069557067.911 2484547 0.12005211349440528%
> 1725151836.363 2491288 0.1444097816486799%
> 2077004221.561 2508662 0.12078271069254648%
> 1650748196.246 3195100 0.19355465644403197%
> 1843881436.725 2731880 0.14815920078094685%
> 1677080490.297 2752567 0.16412849686853959%
> 2103777441.563 3252694 0.15461207710181613%
> 1598451912.768 2345538 0.1467381021139566%
> 1825835416.807 2381575 0.13043755083713246%
> 1969692431.120 2198750 0.11162910336969484%
> 1867823718.088 1813348 0.097083465770326324%
> 1929972836.031 2140104 0.11088777831718792%
> 1627140530.277 2519227 0.15482541016731566%
> 1779407314.725 2828755 0.15897175293095681%
> 1724839045.836 1937500 0.11232932166496305%
> 1780135194.997 2889504 0.16231935687361485%
> 1758590417.053 2193512 0.12473126082853507%
> 1729351400.188 1523643 0.088104881392778986%
> 1484255938.207 2153531 0.14509162096406991%
> 1504352024.282 1910774 0.12701641432044322%
> 1916441202.247 2423510 0.12645887581411155%
> 1651813655.113 2892863 0.1751325272705837%
> 1772411947.023 2693122 0.15194673024651262%
> 1738795901.180 3064160 0.17622309771495132%
> 1717902110.376 3266140 0.19012375503078779%
> 1858583685.387 2246750 0.12088505982619635%
> 1641867344.242 2901893 0.17674345069209688%
> 2037933222.333 3290649 0.16146991294606342%
> 1713754879.753 2136278 0.1246548164640615%
> 2265307843.106 2195967 0.096939010151886218%
> 2066343912.900 2117667 0.10248376307446183%
> 2033259596.459 2652081 0.13043494321230309%
> 2029016228.457 2447926 0.12064595470789147%
> 1649943910.454 2243384 0.13596728869302646%
> 2064977308.315 2517573 0.12191770775700743%
> 1962826877.902 2403546 0.12245328546596376%
> 1991037543.226 3214700 0.16145853255942866%
> 1469646744.644 1960481 0.133398111290676%
> 2195532272.023 1693642 0.077140382839349028%
> 1902754342.931 2542547 0.13362455376575119%
> 1922726134.774 2726990 0.14182935108023245%
> 1629134454.130 2814497 0.17276026499009955%
> 1442857977.322 2886065 0.2000241912483062%
> 1782430811.270 2884583 0.16183421997427802%
> 1574026949.337 2873023 0.18252692568003068%
> 2064769936.712 2717133 0.13159495165484839%
> 1621381781.294 2143854 0.13222388611576869%
> 1783777994.311 2515565 0.14102455619605617%
> 1561198792.593 2402227 0.15387066729728466%
> 1658682002.165 2924639 0.17632306832669586%
> 1675144757.604 2216513 0.13231769910860314%
> 1712859849.041 2274046 0.13276310967725691%
> 1448178253.374 2507727 0.17316424923226392%
> 1621787692.726 2822681 0.17404750403892047%
> 1924315582.555 2275728 0.11826168330344308%
> 2044659318.750 2606738 0.12749008972280165%
> 1559320878.499 2049522 0.13143683434629996%
> 1641314354.590 2786145 0.16975084585158451%
> 1769603624.855 2008994 0.11352790940200608%
> 1707747444.550 2813289 0.16473682973311757%
> 1366611645.499 2930513 0.21443641356721807%
> 1435165128.768 3161655 0.22029903992400404%
> 1440555681.860 2831258 0.19653929630435177%
> 1627011278.312 1956565 0.12025515901954331%
> 1668464384.654 3863649 0.23156916237089648%
> 1589045506.115 2743558 0.17265446391825656%
> 1766150748.153 3207045 0.18158387687765917%
> 1668645748.936 2997211 0.17961937109246526%
> 1808149919.182 2420007 0.13383884678626659%
> 1831096162.185 2226869 0.12161398434382248%
> 1473136553.216 2774496 0.18833936296964501%
> 1873520917.567 2427200 0.12955286366121929%
> 1971938578.205 2207736 0.11195764535473715%
> 1670588149.850 2437810 0.14592525394238479%
> 1912140400.200 2407058 0.12588291109524352%
> 1925513187.452 2415130 0.12542786077699639%
> 1658671061.956 3111210 0.18757245311382492%
> 1917423750.996 2888924 0.15066695603928745%
> 1553962205.772 3386946 0.21795549386076499%
> 1834101490.364 2367281 0.12907033838842713%
> 1626846627.678 3362957 0.20671629044711806%
> 1519340210.104 2320735 0.15274623712098975%
> 1725615314.262 3813557 0.22099693764197714%
> 1725823077.103 3137784 0.1818137699993643%
> 1633124107.941 2007344 0.12291435722731486%
>
> Thanks,
> Fengguang

2009-09-26 11:48:14

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 07:09:21PM +0800, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Wu Fengguang wrote:
>
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> >
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> >
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
>
> No.
>
> But I'd most certainly defer to Nick if he disagrees with me.
>
> I don't think anyone would want to quarrel very long over the swap
> and migration mods alone, but add_to_page_cache() is of a higher
> order of magnitude.

Yup, add_to_page_cache() is hot path.

> I can't see any reason to surrender add_to_page_cache() optimizations
> to the remote possibility of hwpoison (infinitely remote for most of
> us); though I wouldn't myself want to run the benchmark to defend them.
>
> You'd later be sending a patch to replace __SetPageUptodate()s by
> SetPageUptodate()s etc, wouldn't you? Because any non-atomic op
> on page->flags might wipe your locked bit (or your hwpoison bit).

That's a sad fact, there may be more holes than we want/able to handle..

> You could make #ifdef CONFIG_HWPOISON_INJECT select the slower
> versions of these things; or #ifdef CONFIG_MEMORY_FAILURE? that
> would pose distros with a harder choice. But I'd much prefer
> not to go that way.
>
> Please accept that there will be quite a number of places where
> the code "knows" it's the only user of the page, and hwpoison
> handling and testing should work around those places (can shift
> things around slightly to suit itself better, but not add cost).
>
> Look into why you think you want the page lock: I can see it's
> going to be useful if you're taking a page out of a file (but then
> why bother if page->mapping not set?), or if you're using rmap-style
> lookup (but then why bother if !page_mapped?).
>
> I suspect if memory_failure() did something like:
> if (page->mapping)
> lock_page_nosync(p);
> then you'd be okay, perhaps with a few additional _inexpensive_
> tweaks here and there. With the "necessary" memory barriers?
> no, we probably wouldn't want to be adding any in hot paths.
>
> But I definitely say "something like": remember that page_mapping()
> does that weird thing with PageSwapCache (a mistake from day one in
> my opinion), which might or might not be what you want. There are
> probably various reasons why it's not as simple as I suggest above.

Good view point! Could do it if turned out to be simple.

However we may well end up to accept the fact that "we just cannot do
hwpoison 100% correct", and settle with a simple and 99% correct code.

> It seems to me that the Intel hardware guys have done half a job
> here: the sooner they get to remapping the bad pages, the better.

When we can offer to set aside half memory :)

Thanks,
Fengguang

> > CC: Hugh Dickins <[email protected]>
> > CC: Andi Kleen <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > include/linux/pagemap.h | 13 ++++---------
> > mm/migrate.c | 2 +-
> > mm/swap_state.c | 4 ++--
> > 3 files changed, 7 insertions(+), 12 deletions(-)
> >
> > --- sound-2.6.orig/mm/swap_state.c 2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/mm/swap_state.c 2009-09-25 18:42:23.000000000 +0800
> > @@ -306,7 +306,7 @@ struct page *read_swap_cache_async(swp_e
> > * re-using the just freed swap entry for an existing page.
> > * May fail (-ENOMEM) if radix-tree node allocation failed.
> > */
> > - __set_page_locked(new_page);
> > + lock_page(new_page);
> > SetPageSwapBacked(new_page);
> > err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > if (likely(!err)) {
> > @@ -318,7 +318,7 @@ struct page *read_swap_cache_async(swp_e
> > return new_page;
> > }
> > ClearPageSwapBacked(new_page);
> > - __clear_page_locked(new_page);
> > + unlock_page(new_page);
> > swapcache_free(entry, NULL);
> > } while (err != -ENOMEM);
> >
> > --- sound-2.6.orig/include/linux/pagemap.h 2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/include/linux/pagemap.h 2009-09-25 18:42:19.000000000 +0800
> > @@ -292,11 +292,6 @@ extern int __lock_page_killable(struct p
> > extern void __lock_page_nosync(struct page *page);
> > extern void unlock_page(struct page *page);
> >
> > -static inline void __set_page_locked(struct page *page)
> > -{
> > - __set_bit(PG_locked, &page->flags);
> > -}
> > -
> > static inline void __clear_page_locked(struct page *page)
> > {
> > __clear_bit(PG_locked, &page->flags);
> > @@ -435,18 +430,18 @@ extern void remove_from_page_cache(struc
> > extern void __remove_from_page_cache(struct page *page);
> >
> > /*
> > - * Like add_to_page_cache_locked, but used to add newly allocated pages:
> > - * the page is new, so we can just run __set_page_locked() against it.
> > + * Like add_to_page_cache_locked, but used to add newly allocated pages.
> > */
> > static inline int add_to_page_cache(struct page *page,
> > struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
> > {
> > int error;
> >
> > - __set_page_locked(page);
> > + if (!trylock_page(page))
> > + return -EIO; /* hwpoisoned */
> > error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
> > if (unlikely(error))
> > - __clear_page_locked(page);
> > + unlock_page(page);
> > return error;
> > }
> >
> > --- sound-2.6.orig/mm/migrate.c 2009-09-14 10:50:19.000000000 +0800
> > +++ sound-2.6/mm/migrate.c 2009-09-25 18:42:19.000000000 +0800
> > @@ -551,7 +551,7 @@ static int move_to_new_page(struct page
> > * holding a reference to the new page at this point.
> > */
> > if (!trylock_page(newpage))
> > - BUG();
> > + return -EAGAIN; /* got by hwpoison */
> >
> > /* Prepare mapping for the new page.*/
> > newpage->index = page->index;

2009-09-26 11:58:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, 26 Sep 2009, Wu Fengguang wrote:
>
> However we may well end up to accept the fact that "we just cannot do
> hwpoison 100% correct", and settle with a simple and 99% correct code.

I thought we already accepted that: you cannot do hwpoison 100% correct
(or not without a radical rewrite of the OS); you're just looking to
get it right almost all the time on those systems which have almost
all their pages in pagecache or userspace or free.

Hugh

2009-09-26 15:05:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

> However we may well end up to accept the fact that "we just cannot do
> hwpoison 100% correct", and settle with a simple and 99% correct code.

I would prefer to avoid any oopses, but if they are unlikely enough
and too hard to fix that's bearable. The race window here is certainly rather
small.

On the other hand if you cannot detect a difference in benchmarks I see
no reason not to add the additional steps, as long as the code isn't
complicated or ugly. These changes are neither.

-Andi
--
[email protected] -- Speaking for myself only.

2009-09-26 19:06:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 12:09:21PM +0100, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Wu Fengguang wrote:
>
> > The swap cache and page cache code assume that they 'own' the newly
> > allocated page and therefore can disregard the locking rules. However
> > now hwpoison can hit any time on any page.
> >
> > So use the safer lock_page()/trylock_page(). The main intention is not
> > to close such a small time window of memory corruption. But to avoid
> > kernel oops that may result from such races, and also avoid raising
> > false alerts in hwpoison stress tests.
> >
> > This in theory will slightly increase page cache/swap cache overheads,
> > however it seems to be too small to be measurable in benchmark.
>
> No.
>
> But I'd most certainly defer to Nick if he disagrees with me.
>
> I don't think anyone would want to quarrel very long over the swap
> and migration mods alone, but add_to_page_cache() is of a higher
> order of magnitude.
>
> I can't see any reason to surrender add_to_page_cache() optimizations
> to the remote possibility of hwpoison (infinitely remote for most of
> us); though I wouldn't myself want to run the benchmark to defend them.

Thanks Hugh, I definitely agree with you. And I agree the page lock
is a strange thing: it's only really well defined for some types of
pages (eg. pagecache pages), so it's not clear what it even really
means to take the page lock on non-pagecache or soon to be pagecache
pages.

We don't need to run benchmarks: it's unquestionably slower if we
have to go back to full atomic ops here. What we need to focus on
throughout the kernel is reducing atomic ops and unpredictable
branches rather than adding them, because our fastpaths are getting
monotonically slower *anyway*.

I would much prefer that hwpoison code first ensures it has a valid
pagecache page and is pinning it before it ever tries to do a
lock_page.

This is a bit tricky to do right now; you have a chicken and egg
problem between locking the page and pinning the inode mapping.
Possibly you could get the page ref, then check mapping != NULL,
and in that case lock the page. You'd just have to check ordering
on the other side... and if you do something crazy like that,
then please add comments in the core code saying that hwpoison
has added a particular dependency.


> I suspect if memory_failure() did something like:
> if (page->mapping)
> lock_page_nosync(p);
> then you'd be okay, perhaps with a few additional _inexpensive_
> tweaks here and there. With the "necessary" memory barriers?
> no, we probably wouldn't want to be adding any in hot paths.

Ah, you came to the same idea. Yes memory barriers in the fastpath
are no good, but you can effectively have a memory barrier on
all other CPUs by doing a synchronize_rcu() with no cost to the
fastpaths.

2009-09-26 19:12:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 05:05:55PM +0200, Andi Kleen wrote:
> > However we may well end up to accept the fact that "we just cannot do
> > hwpoison 100% correct", and settle with a simple and 99% correct code.
>
> I would prefer to avoid any oopses, but if they are unlikely enough
> and too hard to fix that's bearable. The race window here is certainly rather
> small.

Well, several places non-atomically modify page flags, including
within preempt-enabled regions... It's nasty to introduce these
oopses in the hwposion code! I'm ashamed I didn't pick up on this
problem seeing as I introduced several of them.


> On the other hand if you cannot detect a difference in benchmarks I see
> no reason not to add the additional steps, as long as the code isn't
> complicated or ugly. These changes are neither.

The patch to add atomics back into the fastpaths? I don't think that's
acceptable at all. A config option doesn't go far enough either because
distros will have to turn it on.

2009-09-26 19:14:52

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 07:48:06PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 07:09:21PM +0800, Hugh Dickins wrote:
> > It seems to me that the Intel hardware guys have done half a job
> > here: the sooner they get to remapping the bad pages, the better.
>
> When we can offer to set aside half memory :)

Maybe even adding to the ECC error codes so uncorrected errors
are reduced to a similar frequency to other sources of errors in
the hardware. That seems like the sanest thing to me, but it's
not mine to wonder...

2009-09-26 21:32:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

> This is a bit tricky to do right now; you have a chicken and egg
> problem between locking the page and pinning the inode mapping.

One possibly simple solution would be to just allocate the page
locked (GFP_LOCKED). When the allocator clears the flags it already
modifies the state, so it could as well set the lock bit too. No
atomics needed. And then clearing it later is also atomic free.

Would that satisfy the concerns?

Again another way is to just ignore it.

-Andi

2009-09-27 10:47:51

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, Sep 26, 2009 at 07:31:56PM +0800, Wu Fengguang wrote:
> On Sat, Sep 26, 2009 at 06:52:59PM +0800, Wu Fengguang wrote:
> > On Sat, Sep 26, 2009 at 11:49:36AM +0800, Andi Kleen wrote:
> > > On Sat, Sep 26, 2009 at 11:15:37AM +0800, Wu Fengguang wrote:
> > > > The swap cache and page cache code assume that they 'own' the newly
> > > > allocated page and therefore can disregard the locking rules. However
> > > > now hwpoison can hit any time on any page.
> > > >
> > > > So use the safer lock_page()/trylock_page(). The main intention is not
> > > > to close such a small time window of memory corruption. But to avoid
> > > > kernel oops that may result from such races, and also avoid raising
> > > > false alerts in hwpoison stress tests.
> > > >
> > > > This in theory will slightly increase page cache/swap cache overheads,
> > > > however it seems to be too small to be measurable in benchmark.
> > >
> > > Thanks. Can you please describe what benchmarks you used?
> >
> > Good question. The test case is to copy 1GB sparse file over NFS.
> >
> > I ran it 100 times (2 hours total wall time) and the overheads are ~0.008%.
>
> And standard deviation is 0.04%, much larger than the difference 0.008% ..

Sorry that's not correct. I improved the accounting by treating
function0+function1 from two CPUs as an integral entity:

total time add_to_page_cache_lru percent stddev
before 3880166848.722 9683329.610 0.250% 0.014%
after 3828516894.376 9778088.870 0.256% 0.012%
delta 0.006%

Thanks,
Fengguang


> > The average time reported by the function profiler:
> >
> > total time add_to_page_cache_lru percent
> > before 1940083424.361 4841664.805 0.242%
> > after 1914258447.188 4889044.435 0.250%
> >
> > The script used is:
> >
> > for i in `seq 1 100`
> > do
> > # copy 1GB sparse file over NFS
> > echo 1 > /debug/tracing/function_profile_enabled
> > cp /tmp/sparse/1G /dev/null
> > echo 0 > /debug/tracing/function_profile_enabled
> >
> > fadvise /tmp/sparse/1G 0 0 dontneed
> > cat /debug/tracing/trace_stat/function0 > function0.$i
> > cat /debug/tracing/trace_stat/function1 > function1.$i
> > done
> >
> >
> > File function0.1 after patch:
> >
> > Function Hit Time Avg
> > -------- --- ---- ---
> > schedule 32492 474358128 us 14599.22 us
> > schedule_timeout 26 99994288 us 3845934 us
> > poll_schedule_timeout 8 52914376 us 6614297 us
> > sys_poll 6 52914373 us 8819062 us
> > schedule_hrtimeout_range 8 52914373 us 6614296 us
> > do_sys_poll 6 52914355 us 8819059 us
> > sys_read 86997 40703194 us 467.868 us
> > vfs_read 87002 40550651 us 466.088 us
> > do_sync_read 86987 40011054 us 459.965 us
> > nfs_file_read 86985 39945631 us 459.224 us
> > generic_file_aio_read 86985 39785858 us 457.387 us
> > page_cache_async_readahead 2886 34379069 us 11912.35 us
> > ondemand_readahead 2889 34371850 us 11897.49 us
> > ra_submit 2891 34368378 us 11888.05 us
> > __do_page_cache_readahead 2891 34360744 us 11885.41 us
> > nfs_readpages 2876 33198107 us 11543.15 us
> > read_cache_pages 2876 31617284 us 10993.49 us
> > __rpc_execute 84468 25404441 us 300.758 us
> > readpage_async_filler 174071 24947004 us 143.315 us
> > do_softirq 44822 24829675 us 553.961 us
> > __do_softirq 44822 24803408 us 553.375 us
> > net_rx_action 38424 24357207 us 633.906 us
> > e1000_clean 39042 24253141 us 621.206 us
> > nfs_pageio_doio 43534 22993433 us 528.171 us
> > nfs_pagein_one 43520 22939752 us 527.108 us
> > nfs_read_rpcsetup 43513 22533176 us 517.849 us
> > e1000_clean_rx_irq 39042 22047306 us 564.707 us
> > rpc_run_task 43528 22042766 us 506.404 us
> > call_transmit 45873 21842655 us 476.154 us
> > nfs_pageio_add_request 174091 21583666 us 123.979 us
> > rpc_execute 43528 21254402 us 488.292 us
> > cpuidle_idle_call 18073 19445493 us 1075.941 us
> > acpi_idle_enter_c1 17999 19396945 us 1077.667 us
> > e1000_receive_skb 436279 18818686 us 43.134 us
> > tcp_v4_do_rcv 513521 18681598 us 36.379 us
> > xprt_transmit 43346 18532249 us 427.542 us
> > tcp_rcv_established 513521 18384627 us 35.801 us
> > napi_gro_receive 436279 18114586 us 41.520 us
> > napi_skb_finish 436279 17130261 us 39.264 us
> > netif_receive_skb 436279 16886046 us 38.704 us
> > ip_rcv 436279 16500338 us 37.820 us
> > xs_tcp_send_request 43346 16418581 us 378.779 us
> > xs_sendpages 43346 16367619 us 377.603 us
> > xs_send_kvec 43347 16310325 us 376.273 us
> > kernel_sendmsg 43347 16254698 us 374.990 us
> > sock_sendmsg 43347 16177184 us 373.201 us
> > tcp_sendmsg 43347 16127746 us 372.061 us
> > do_IRQ 45271 15968995 us 352.742 us
> > ip_local_deliver 436279 15816133 us 36.252 us
> > irq_exit 53709 15421475 us 287.130 us
> > acpi_processor_ffh_cstate_ente 17999 15394735 us 855.310 us
> > mwait_idle_with_hints 17999 15386089 us 854.830 us
> > tcp_v4_rcv 436217 15173852 us 34.785 us
> > release_sock 43364 13536313 us 312.155 us
> > xs_tcp_data_ready 512045 10092920 us 19.711 us
> > _spin_unlock_bh 518900 9851221 us 18.984 us
> > local_bh_enable_ip 511352 9403704 us 18.389 us
> > tcp_read_sock 1024090 8832051 us 8.624 us
> > __tcp_ack_snd_check 512045 6644838 us 12.977 us
> > add_to_page_cache_lru 174072 6287717 us 36.121 us
> > tcp_transmit_skb 296572 5962431 us 20.104 us
> > tcp_send_ack 253210 5775135 us 22.807 us
> > __kfree_skb 802932 5216451 us 6.496 us
> > add_to_page_cache_locked 174072 4716425 us 27.094 us
> > [...]
> >
> > Detailed time before patch (100 runs * 2 CPU):
> >
> > Total add_to_page_cache_lru percent
> > 1927694859.404 6703561 0.34775011030908648%
> > 1994166534.446 7311783 0.36665859514242066%
> > 2132812997.930 6249683 0.29302536162643533%
> > 1972582396.980 7908593 0.40092586307715006%
> > 2415371209.622 7670327 0.31756307144194151%
> > 2163473353.153 7406991 0.34236571433640306%
> > 1833544521.635 7952113 0.43370165851816772%
> > 2087748556.683 7995535 0.38297404035585819%
> > 2185868844.795 7117368 0.32560819085499598%
> > 1980423680.193 6514699 0.32895481230385598%
> > 1958021484.352 6745228 0.34449203207963308%
> > 2008592063.702 7398967 0.36836583862444905%
> > 2448117568.931 6224454 0.25425470079519025%
> > 2213021186.789 6938228 0.31351837214297412%
> > 1828902235.789 7479842 0.40897987074596959%
> > 1866644864.837 7720792 0.41361868802366963%
> > 2054526163.572 8475856 0.41254553727677401%
> > 1902862587.026 8334661 0.43800645705196778%
> > 2148449373.925 8672498 0.40366313050031155%
> > 2104880443.231 7741619 0.36779376353160392%
> > 2270646549.540 6486688 0.2856758134071597%
> > 2250814274.826 7512331 0.33376058984612328%
> > 2214008898.106 6391259 0.28867359139646992%
> > 2446226035.589 7738434 0.31634173978271579%
> > 1920377000.358 6464992 0.33665223020244384%
> > 2030446733.714 6022617 0.2966153654759367%
> > 2157128575.464 7375670 0.34192074055731642%
> > 2185670134.683 7943199 0.36342167438509865%
> > 2224381465.969 8492920 0.38181041021667828%
> > 2479045291.455 8161934 0.32923698603382928%
> > 1971641972.432 8114434 0.41155717485517562%
> > 2208169216.013 8138519 0.36856410011433138%
> > 1971312194.985 8122576 0.41203904793283169%
> > 2279129743.345 7365286 0.3231622079219687%
> > 2360162707.877 7412468 0.3140659741491984%
> > 1885022652.013 8252339 0.43778460652382062%
> > 2065345722.062 7985361 0.3866355600759942%
> > 1975982196.916 7925371 0.4010851419799969%
> > 2169738273.075 7203856 0.33201497569522709%
> > 2034570957.887 7396992 0.36356520136717829%
> > 1789538123.758 7380292 0.41241323121418116%
> > 1912812047.546 8697404 0.45469203370807615%
> > 2221047646.893 7451321 0.33548676951723994%
> > 2163981380.559 7982481 0.36887937538251664%
> > 1969189605.396 7353112 0.37340802428831149%
> > 2278909389.270 7088803 0.31106120468750831%
> > 2104996788.904 7342692 0.34882200479854836%
> > 2374705789.293 7327329 0.30855733931492624%
> > 2137944282.456 8112283 0.3794431438915179%
> > 2268117031.230 6889022 0.30373309247909852%
> > 1899091967.945 8173918 0.43041190937398177%
> > 2098779842.987 7858621 0.37443760603377763%
> > 2124662819.215 8156209 0.38388251190904155%
> > 2317714239.959 7048899 0.30413149638864423%
> > 2218944220.453 6659501 0.30012025262358577%
> > 1945126557.801 7704643 0.3960998305791596%
> > 2092060468.851 7616315 0.36405807161888715%
> > 1933379809.531 7300893 0.37762331870896354%
> > 1903832170.769 7508856 0.39440745435912067%
> > 2435478683.656 6790800 0.27882814354203433%
> > 2126189870.763 7104836 0.33415811530747125%
> > 2161363441.772 7011298 0.32439236569356278%
> > 2064449342.765 6822211 0.33046153561039859%
> > 2116816508.947 8156178 0.38530396779913867%
> > 1982774633.856 7131335 0.35966442571092105%
> > 2323253766.182 7275361 0.3131539526978242%
> > 1905927501.820 7477458 0.39232646534874271%
> > 2252162684.001 8064426 0.35807475442552977%
> > 2027146552.281 7875866 0.38851981328818402%
> > 1845397260.011 7598767 0.41176862915439172%
> > 2177206310.501 7366288 0.33833670077434846%
> > 2041629016.560 6600368 0.32328929234759562%
> > 2189308825.112 8348863 0.38134697600613249%
> > 1930990433.231 7344038 0.38032492930126544%
> > 2102320863.469 7299411 0.34720727586536809%
> > 1895798071.683 8051595 0.42470741585110772%
> > 2019648719.623 7992832 0.39575357448754711%
> > 1961118039.267 7837733 0.39965636147681766%
> > 2142238585.905 8084509 0.37738602288244455%
> > 2263212432.641 7757018 0.34274369865263332%
> > 2200111036.549 7069702 0.32133387281622078%
> > 1941699880.568 7786856 0.4010329339734075%
> > 1939777157.562 7729686 0.39848319534368681%
> > 1983015138.927 8136252 0.41029701893261827%
> > 2040218650.019 7968023 0.39054750332400873%
> > 1850721920.862 7835912 0.42339758943095634%
> > 2167986192.252 8656589 0.39929170356052646%
> > 2128868496.040 8248906 0.38747841942065214%
> > 1984840247.084 7754853 0.39070413910605312%
> > 2130780080.168 7053725 0.33103955990821227%
> > 2062729676.752 7968701 0.38631824081513272%
> > 1995756518.448 7903687 0.39602461156666058%
> > 1921150053.008 8307746 0.43243608103345821%
> > 2253731991.230 8033382 0.35644797301810893%
> > 2143807312.838 6755593 0.31512127790332323%
> > 1948787937.040 7152297 0.36701258582622248%
> > 2276500726.783 7277806 0.31969267193182554%
> > 2282733838.581 6336839 0.27759867983291148%
> > 2003104438.909 7092547 0.35407774363791977%
> > 2239684291.880 6982252 0.3117516171950766%
> > 1801833773.742 2645976 0.14684906224756283%
> > 2016683111.150 2261656 0.11214731692329717%
> > 1962727948.601 2767184 0.14098663046870061%
> > 1845203884.375 2266317 0.1228220371304734%
> > 1516914718.568 1975612 0.13023883121557553%
> > 1557730294.184 2109274 0.13540688063108641%
> > 2171303320.533 1865384 0.085910797554627022%
> > 1617322895.641 2073254 0.12819048104666192%
> > 1867107731.934 1901409 0.10183713384500154%
> > 1772588329.145 2045529 0.11539786008783279%
> > 2030802387.562 2161163 0.10641916777508319%
> > 1853583237.019 1803299 0.097287187539531861%
> > 1453728764.040 3231016 0.22225714176699726%
> > 1989811798.738 2418414 0.1215398361560542%
> > 1956758511.567 2254818 0.11523230826241863%
> > 2001579398.410 1952360 0.097540971971978793%
> > 1636515132.388 1532408 0.093638486419854422%
> > 1926271619.408 1918912 0.09961793449408439%
> > 1642519133.238 1384193 0.084272564744573439%
> > 1677033195.955 1340416 0.07992781557533149%
> > 1693983317.244 3135804 0.18511421972571418%
> > 1487434932.071 2467493 0.1658891388656871%
> > 1720442169.343 3034092 0.17635536108479918%
> > 1640362983.552 1944463 0.11853858075908964%
> > 1821382248.526 2659828 0.1460334864991977%
> > 1947322513.836 3540608 0.18181929160904178%
> > 1673422299.278 2769832 0.16551900863249205%
> > 1836039325.613 2175872 0.11850900847526999%
> > 1582929718.539 2072026 0.13089816785500888%
> > 1476014468.174 1996943 0.13529291501257767%
> > 1847131237.607 2120975 0.11482535495137695%
> > 1734796451.802 1842211 0.10619176665287874%
> > 1804827020.363 1988047 0.11015166426310204%
> > 1647250758.752 2634152 0.15991202225917941%
> > 1506833507.228 2271966 0.15077750720977479%
> > 1919459367.994 1608374 0.083793073550751385%
> > 1913364635.281 2126200 0.11112361756846952%
> > 1733934046.047 2031440 0.11715785872197679%
> > 1762796865.803 2285487 0.12965118354456009%
> > 1711072016.023 2270996 0.13272357789349024%
> > 2082560830.257 2273483 0.10916766353083858%
> > 1857948273.729 1266725 0.06817870109255604%
> > 1738287034.188 2404600 0.1383315846409251%
> > 1618942973.292 2016207 0.12453848179100423%
> > 1820789056.854 2369799 0.13015230902664757%
> > 1561747605.669 2487149 0.15925422206327566%
> > 1884502059.843 2512148 0.13330566485076115%
> > 1352203145.315 2618029 0.19361210695824271%
> > 1911003204.592 1648292 0.086252707271200574%
> > 1525244016.511 2555354 0.16753738892517536%
> > 2190575134.148 1939935 0.088558249829422828%
> > 1643983673.168 1747429 0.10629235730989092%
> > 1781900292.843 1686195 0.094629032094141294%
> > 1566821289.746 2291057 0.14622324926229507%
> > 1777199439.264 2456866 0.13824368530171682%
> > 1633063412.403 1953080 0.11959609070698031%
> > 1913601422.874 1821650 0.095194849785599556%
> > 2094732749.709 2322465 0.11087166132876074%
> > 1951529182.943 2270128 0.11632559839953495%
> > 1450396654.276 2867633 0.19771370759479912%
> > 1643954213.946 2460211 0.14965203891504558%
> > 1845971802.604 2593942 0.14051904781757141%
> > 1671974226.831 2552423 0.15265923116756236%
> > 1792483105.554 1547320 0.086322710390164162%
> > 1774273751.516 2101959 0.1184686973024323%
> > 1797523443.598 1781759 0.099122990932099045%
> > 1877038808.523 2131598 0.11356174365288201%
> > 1485614991.186 1796722 0.12094129439052281%
> > 2008042349.453 1972441 0.09822706182154485%
> > 2178564679.489 1948988 0.089462021410222761%
> > 1462911389.586 2076158 0.14191960051575969%
> > 1976013519.965 2692701 0.13626936115536772%
> > 1473561005.083 1369873 0.092963439944099252%
> > 2190713955.687 2450228 0.11184609444967988%
> > 1648795783.114 2557892 0.15513698095279177%
> > 2116292380.702 1948901 0.092090347145394241%
> > 1704227210.803 1980402 0.11620527987385389%
> > 1984573918.515 2325530 0.11718031655581404%
> > 1856414008.018 1798348 0.096872141248277158%
> > 1401593348.875 2285205 0.16304336788086488%
> > 1832475317.969 2373585 0.12952889333487622%
> > 1840826298.949 2112188 0.11474129857911804%
> > 2014475390.352 2104652 0.10447643143618861%
> > 1730176303.748 1975645 0.11418749613667999%
> > 2016553863.650 2009045 0.099627638825555156%
> > 1891795883.646 2160557 0.11420666567029553%
> > 1881160108.475 1492908 0.079361027978115894%
> > 1562754135.542 1978972 0.12663361145505117%
> > 2061996559.038 2071212 0.10044691834821971%
> > 1519528906.706 2460493 0.16192472477103448%
> > 1796244525.568 2099254 0.1168690548596764%
> > 1998463642.677 1988093 0.099481069234609226%
> > 1872169708.204 1762657 0.094150492462082555%
> > 1664425938.102 1710870 0.10279039522485223%
> > 1791543540.761 1992110 0.11119517637588673%
> > 1883913354.514 1770358 0.093972368514618113%
> > 1813806740.237 1877185 0.10349421238531283%
> > 1547586337.236 2702003 0.17459465329900728%
> > 2020313568.921 2100334 0.10396079263684484%
> > 1727289498.314 2059429 0.1192289423405978%
> >
> > Detailed time after patch:
> >
> > Total add_to_page_cache_lru percent
> > 2151489401.227 6287717 0.29224949918015392%
> > 1919030541.464 6872841 0.3581413037208262%
> > 2277104485.950 8538296 0.37496285535785828%
> > 2142847301.255 7992466 0.37298345968558083%
> > 1909650847.324 6960671 0.36449966808089612%
> > 2109123648.836 6825371 0.32361170497361974%
> > 2128503623.032 7728883 0.36311345286743746%
> > 1925232499.136 7526064 0.39091714914315673%
> > 1935780958.829 7499162 0.38739723964101919%
> > 1722848267.832 7353451 0.42681942091473007%
> > 1855159456.452 7600681 0.40970499724785564%
> > 1862769335.900 8607712 0.46209221045831578%
> > 2272365556.764 6895008 0.30342864419310028%
> > 2049429274.204 7252089 0.35385895435775488%
> > 2001218334.232 8155639 0.4075336938750293%
> > 1947131845.644 7229822 0.37130623774523025%
> > 1982915408.746 7513518 0.37891268416495716%
> > 2038654062.353 7531559 0.36943781385386831%
> > 2134879530.359 6738727 0.31564905205057731%
> > 2029761134.284 7472398 0.36814174208907074%
> > 1956557094.935 7280443 0.37210480690019776%
> > 1740007857.373 6663035 0.38293131676196024%
> > 2054278081.418 7765213 0.37800203732105886%
> > 2367585922.353 7629987 0.32226864199365651%
> > 2023763123.141 7730698 0.38199618876350999%
> > 1826074987.177 7579606 0.41507638258150648%
> > 2077873224.783 7414675 0.35683962387909118%
> > 2103719398.720 7093390 0.33718327664402131%
> > 2164758918.301 7181094 0.33172719323572736%
> > 1953199921.042 7750654 0.39681826302067191%
> > 2179298642.359 6395109 0.29344803303679207%
> > 1971922836.328 7215479 0.36591081897688477%
> > 2225552712.785 8153246 0.36634701812105075%
> > 2124780733.448 7080842 0.33325048032177529%
> > 2139511762.563 8708060 0.40701155059639843%
> > 1934365642.348 6842114 0.35371358186939283%
> > 1950765025.910 6818753 0.34954250816646476%
> > 2013188877.577 6495929 0.32266863146086228%
> > 1921391492.273 6536576 0.34020011154870117%
> > 2164698840.945 6712421 0.31008567441509277%
> > 1992440404.239 7467130 0.3747730664422067%
> > 2073623249.374 7113197 0.34303227465005426%
> > 1938481781.041 6902169 0.35606055561138211%
> > 1963963682.421 7776006 0.39593430721765843%
> > 1807704433.895 7206899 0.39867684477995868%
> > 1902413226.299 7687850 0.40411041585093088%
> > 1718395340.661 7096452 0.41296969516166576%
> > 2014139552.830 7653595 0.37999328245384933%
> > 1976902280.499 7781174 0.3936043818026207%
> > 1920129734.441 6808420 0.3545812492707483%
> > 1786139514.638 7179970 0.4019825966089316%
> > 2011990483.317 6739406 0.33496212113733792%
> > 2186643920.919 7830962 0.35812698743876015%
> > 1822975832.465 8627042 0.47323951565143058%
> > 1872249428.425 7264440 0.38800599373722838%
> > 1971668701.573 6818269 0.34581210294408871%
> > 2054585589.388 7261214 0.35341501651254642%
> > 2252858576.565 7228902 0.32087686618225808%
> > 2196771348.820 6634658 0.30201859668116204%
> > 2224872312.502 6948714 0.31231967609798522%
> > 1924947573.139 6935037 0.360271474235066%
> > 2070207954.999 7564561 0.36540102078796494%
> > 2070702822.858 6994172 0.33776802362912656%
> > 2099826181.510 7199665 0.34286957003377627%
> > 2369037401.585 7009501 0.29587971027009985%
> > 2109093017.803 7056062 0.33455432929886414%
> > 2141046841.968 7831694 0.36578807368835009%
> > 2284785719.942 7358167 0.3220506385249462%
> > 2088247084.213 6857813 0.32840045853982414%
> > 1880913709.753 6697468 0.35607523967059107%
> > 1984586172.943 7361786 0.37094816543455983%
> > 2130320004.839 7840067 0.36802297223850727%
> > 2314059227.236 6669054 0.28819720435443524%
> > 1935408704.712 8309322 0.42933164348025787%
> > 2239440115.241 7040432 0.31438357972087749%
> > 2387500130.769 6555638 0.27458168129559291%
> > 2497500671.323 6969731 0.27906823329532587%
> > 2229438175.095 6759889 0.30321042653322067%
> > 1979317661.013 7870099 0.39761677243723181%
> > 2228213884.202 5883791 0.26405862748257614%
> > 2264526373.743 6563122 0.28982316461838858%
> > 2256376049.198 6670128 0.29561242694323103%
> > 2085817937.654 6878115 0.32975624937505721%
> > 2177225113.484 7421724 0.34087995559282069%
> > 1843473456.551 7156295 0.38819625932607071%
> > 2403022978.094 6958096 0.28955594946157931%
> > 1935191866.634 6872608 0.35513832599730566%
> > 1917810458.850 7816937 0.40759695328220108%
> > 2016838266.381 7257719 0.35985627211562177%
> > 1904575178.166 7484129 0.39295529448235278%
> > 2087232238.930 7372521 0.35321996577532039%
> > 1976631116.176 6438233 0.32571747693901715%
> > 2004191806.805 6735567 0.33607397141980955%
> > 2110550986.328 7174599 0.33993961986592813%
> > 2152621184.889 7337361 0.34085704681840484%
> > 2059135774.936 6363510 0.30903790208772336%
> > 2353978555.486 7265456 0.30864580236160993%
> > 1945677192.325 5927833 0.30466682877217138%
> > 2255676557.335 6379740 0.28283044301074045%
> > 2034627875.756 7319043 0.35972391252530572%
> > 1712556465.847 3079747 0.17983331127577226%
> > 1837380343.831 3111263 0.1693314620702272%
> > 1680363376.083 1611968 0.095929727042585719%
> > 1769978476.954 2017096 0.11396161175198416%
> > 1829700109.543 2612874 0.14280340184559598%
> > 1783959966.875 3184558 0.17851062014460206%
> > 1772032973.967 2133741 0.12041203698502599%
> > 1667688494.152 1985120 0.11903422053705601%
> > 2001252356.720 2633874 0.13161128785961057%
> > 2077009766.270 2858188 0.13761071548223289%
> > 2076612330.379 2340135 0.11269002720276176%
> > 1834812072.536 1713713 0.093399919569494547%
> > 1787949752.795 2522212 0.14106727529994451%
> > 1907464219.876 2476303 0.12982172741153589%
> > 1609254879.065 1903707 0.11829741980375917%
> > 2069557067.911 2484547 0.12005211349440528%
> > 1725151836.363 2491288 0.1444097816486799%
> > 2077004221.561 2508662 0.12078271069254648%
> > 1650748196.246 3195100 0.19355465644403197%
> > 1843881436.725 2731880 0.14815920078094685%
> > 1677080490.297 2752567 0.16412849686853959%
> > 2103777441.563 3252694 0.15461207710181613%
> > 1598451912.768 2345538 0.1467381021139566%
> > 1825835416.807 2381575 0.13043755083713246%
> > 1969692431.120 2198750 0.11162910336969484%
> > 1867823718.088 1813348 0.097083465770326324%
> > 1929972836.031 2140104 0.11088777831718792%
> > 1627140530.277 2519227 0.15482541016731566%
> > 1779407314.725 2828755 0.15897175293095681%
> > 1724839045.836 1937500 0.11232932166496305%
> > 1780135194.997 2889504 0.16231935687361485%
> > 1758590417.053 2193512 0.12473126082853507%
> > 1729351400.188 1523643 0.088104881392778986%
> > 1484255938.207 2153531 0.14509162096406991%
> > 1504352024.282 1910774 0.12701641432044322%
> > 1916441202.247 2423510 0.12645887581411155%
> > 1651813655.113 2892863 0.1751325272705837%
> > 1772411947.023 2693122 0.15194673024651262%
> > 1738795901.180 3064160 0.17622309771495132%
> > 1717902110.376 3266140 0.19012375503078779%
> > 1858583685.387 2246750 0.12088505982619635%
> > 1641867344.242 2901893 0.17674345069209688%
> > 2037933222.333 3290649 0.16146991294606342%
> > 1713754879.753 2136278 0.1246548164640615%
> > 2265307843.106 2195967 0.096939010151886218%
> > 2066343912.900 2117667 0.10248376307446183%
> > 2033259596.459 2652081 0.13043494321230309%
> > 2029016228.457 2447926 0.12064595470789147%
> > 1649943910.454 2243384 0.13596728869302646%
> > 2064977308.315 2517573 0.12191770775700743%
> > 1962826877.902 2403546 0.12245328546596376%
> > 1991037543.226 3214700 0.16145853255942866%
> > 1469646744.644 1960481 0.133398111290676%
> > 2195532272.023 1693642 0.077140382839349028%
> > 1902754342.931 2542547 0.13362455376575119%
> > 1922726134.774 2726990 0.14182935108023245%
> > 1629134454.130 2814497 0.17276026499009955%
> > 1442857977.322 2886065 0.2000241912483062%
> > 1782430811.270 2884583 0.16183421997427802%
> > 1574026949.337 2873023 0.18252692568003068%
> > 2064769936.712 2717133 0.13159495165484839%
> > 1621381781.294 2143854 0.13222388611576869%
> > 1783777994.311 2515565 0.14102455619605617%
> > 1561198792.593 2402227 0.15387066729728466%
> > 1658682002.165 2924639 0.17632306832669586%
> > 1675144757.604 2216513 0.13231769910860314%
> > 1712859849.041 2274046 0.13276310967725691%
> > 1448178253.374 2507727 0.17316424923226392%
> > 1621787692.726 2822681 0.17404750403892047%
> > 1924315582.555 2275728 0.11826168330344308%
> > 2044659318.750 2606738 0.12749008972280165%
> > 1559320878.499 2049522 0.13143683434629996%
> > 1641314354.590 2786145 0.16975084585158451%
> > 1769603624.855 2008994 0.11352790940200608%
> > 1707747444.550 2813289 0.16473682973311757%
> > 1366611645.499 2930513 0.21443641356721807%
> > 1435165128.768 3161655 0.22029903992400404%
> > 1440555681.860 2831258 0.19653929630435177%
> > 1627011278.312 1956565 0.12025515901954331%
> > 1668464384.654 3863649 0.23156916237089648%
> > 1589045506.115 2743558 0.17265446391825656%
> > 1766150748.153 3207045 0.18158387687765917%
> > 1668645748.936 2997211 0.17961937109246526%
> > 1808149919.182 2420007 0.13383884678626659%
> > 1831096162.185 2226869 0.12161398434382248%
> > 1473136553.216 2774496 0.18833936296964501%
> > 1873520917.567 2427200 0.12955286366121929%
> > 1971938578.205 2207736 0.11195764535473715%
> > 1670588149.850 2437810 0.14592525394238479%
> > 1912140400.200 2407058 0.12588291109524352%
> > 1925513187.452 2415130 0.12542786077699639%
> > 1658671061.956 3111210 0.18757245311382492%
> > 1917423750.996 2888924 0.15066695603928745%
> > 1553962205.772 3386946 0.21795549386076499%
> > 1834101490.364 2367281 0.12907033838842713%
> > 1626846627.678 3362957 0.20671629044711806%
> > 1519340210.104 2320735 0.15274623712098975%
> > 1725615314.262 3813557 0.22099693764197714%
> > 1725823077.103 3137784 0.1818137699993643%
> > 1633124107.941 2007344 0.12291435722731486%
> >
> > Thanks,
> > Fengguang

2009-09-27 16:26:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sat, 26 Sep 2009, Andi Kleen wrote:
> > This is a bit tricky to do right now; you have a chicken and egg
> > problem between locking the page and pinning the inode mapping.
>
> One possibly simple solution would be to just allocate the page
> locked (GFP_LOCKED). When the allocator clears the flags it already
> modifies the state, so it could as well set the lock bit too. No
> atomics needed. And then clearing it later is also atomic free.

That's a good idea.

I don't particularly like adding a GFP_LOCKED just for this, and I
don't particularly like having to remember to unlock the thing on the
various(?) error paths between getting the page and adding it to cache.

But it is a good idea, and if doing it that way would really close a
race window which checking page->mapping (or whatever) cannot (I'm
simply not sure about that), then it would seem the best way to go.

Hugh

2009-09-27 19:20:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> >
> > And standard deviation is 0.04%, much larger than the difference 0.008% ..
>
> Sorry that's not correct. I improved the accounting by treating
> function0+function1 from two CPUs as an integral entity:
>
> total time add_to_page_cache_lru percent stddev
> before 3880166848.722 9683329.610 0.250% 0.014%
> after 3828516894.376 9778088.870 0.256% 0.012%
> delta 0.006%

I don't understand why you're doing this NFS workload to measure?
I see significant nfs, networking protocol and device overheads in
your profiles, also you're hitting some locks or something which
is causing massive context switching. So I don't think this is a
good test. But anyway as Hugh points out, you need to compare with
a *completely* fixed kernel, which includes auditing all users of
page flags non-atomically (slab, notably, but possibly also other
places).

One other thing to keep in mind that I will mention is that I am
going to push in a patch to the page allocator to allow callers
to avoid the refcounting (atomic_dec_and_test) in page lifetime,
which is especially important for SLUB and takes more cycles off
the page allocator...

I don't know exactly what you're going to do after that to get a
stable reference to slab pages. I guess you can read the page
flags and speculatively take some slab locks and recheck etc...

2009-09-27 19:22:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> On Sat, 26 Sep 2009, Andi Kleen wrote:
> > > This is a bit tricky to do right now; you have a chicken and egg
> > > problem between locking the page and pinning the inode mapping.
> >
> > One possibly simple solution would be to just allocate the page
> > locked (GFP_LOCKED). When the allocator clears the flags it already
> > modifies the state, so it could as well set the lock bit too. No
> > atomics needed. And then clearing it later is also atomic free.
>
> That's a good idea.
>
> I don't particularly like adding a GFP_LOCKED just for this, and I
> don't particularly like having to remember to unlock the thing on the
> various(?) error paths between getting the page and adding it to cache.

God no, please no more crazy branches in the page allocator.

I'm going to resubmit my patches to allow 0-ref page allocations,
so the pagecache will be able to work with those to do what we
want here.


> But it is a good idea, and if doing it that way would really close a
> race window which checking page->mapping (or whatever) cannot (I'm
> simply not sure about that), then it would seem the best way to go.

Yep, seems reasonable: the ordering is no technical burden, and a
simple comment pointing to hwpoison will keep it maintainable.

2009-09-27 21:57:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sun, 27 Sep 2009, Nick Piggin wrote:
> On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> >
> > I don't particularly like adding a GFP_LOCKED just for this, and I
> > don't particularly like having to remember to unlock the thing on the
> > various(?) error paths between getting the page and adding it to cache.
>
> God no, please no more crazy branches in the page allocator.
>
> I'm going to resubmit my patches to allow 0-ref page allocations,
> so the pagecache will be able to work with those to do what we
> want here.
>
> > But it is a good idea, and if doing it that way would really close a
> > race window which checking page->mapping (or whatever) cannot (I'm
> > simply not sure about that), then it would seem the best way to go.
>
> Yep, seems reasonable: the ordering is no technical burden, and a
> simple comment pointing to hwpoison will keep it maintainable.

You move from "God no" to "Yep, seems reasonable"!

I think perhaps you couldn't bring yourself to believe that I was
giving any support to Andi's GFP_LOCKED idea. Pretend I did not!

I'll assume we stick with the "God no", and we'll see how what
you come up with affects what they want.

Hugh

2009-09-27 23:01:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Sun, Sep 27, 2009 at 10:57:29PM +0100, Hugh Dickins wrote:
> On Sun, 27 Sep 2009, Nick Piggin wrote:
> > On Sun, Sep 27, 2009 at 05:26:25PM +0100, Hugh Dickins wrote:
> > >
> > > I don't particularly like adding a GFP_LOCKED just for this, and I
> > > don't particularly like having to remember to unlock the thing on the
> > > various(?) error paths between getting the page and adding it to cache.
> >
> > God no, please no more crazy branches in the page allocator.
> >
> > I'm going to resubmit my patches to allow 0-ref page allocations,
> > so the pagecache will be able to work with those to do what we
> > want here.
> >
> > > But it is a good idea, and if doing it that way would really close a
> > > race window which checking page->mapping (or whatever) cannot (I'm
> > > simply not sure about that), then it would seem the best way to go.
> >
> > Yep, seems reasonable: the ordering is no technical burden, and a
> > simple comment pointing to hwpoison will keep it maintainable.
>
> You move from "God no" to "Yep, seems reasonable"!
>
> I think perhaps you couldn't bring yourself to believe that I was
> giving any support to Andi's GFP_LOCKED idea. Pretend I did not!
>
> I'll assume we stick with the "God no", and we'll see how what
> you come up with affects what they want.

Well, yes, I mean "no" to a GFP_LOCKED... If you follow me :)

Reasonable being the basic idea of setting up our flags before we
increment page count, although of course we'd want to see how all
the error cases etc pan out.

There is no real rush AFAIKS to fix this one single pagecache site
while we have problems with slab allocators and all other unaudited
places that nonatomically modify page flags with an elevated
page reference ... just mark HWPOISON as broken for the moment, or
cut it down to do something much simpler I guess?

2009-09-28 01:19:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

> There is no real rush AFAIKS to fix this one single pagecache site
> while we have problems with slab allocators and all other unaudited
> places that nonatomically modify page flags with an elevated

hwpoison ignores slab pages.

> page reference ... just mark HWPOISON as broken for the moment, or
> cut it down to do something much simpler I guess?

Erm no. These cases are *EXTREMLY* unlikely to hit.

I'll look into exploiting the ordering of the mapping assignment.

-Andi

--
[email protected] -- Speaking for myself only.

2009-09-28 01:52:19

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 09:19:43AM +0800, Andi Kleen wrote:
> > There is no real rush AFAIKS to fix this one single pagecache site
> > while we have problems with slab allocators and all other unaudited
> > places that nonatomically modify page flags with an elevated
>
> hwpoison ignores slab pages.
>
> > page reference ... just mark HWPOISON as broken for the moment, or
> > cut it down to do something much simpler I guess?
>
> Erm no. These cases are *EXTREMLY* unlikely to hit.
>
> I'll look into exploiting the ordering of the mapping assignment.

Andi, given that overheads of this patch is considered unacceptable,
I think we can just ignore it.

The proposed schemes are already tricky enough (and may not achieve
100% correctness). We have not even considered the interaction with
free buddy pages, unpoison, and hwpoison filtering.

It may well result in something unmanageable.

On the other hand, we may just ignore the __set_page_locked race,

- it could trigger BUG() on unlock_page(), however that's _no worse_
than plain kernel without hwpoison. Plain kernel will also die when
trying to fill data into the newly allocated pages.
- it is _not yet_ a LRU page. So it does not hurt the general idea of
"hwpoison can handle LRU pages reliably".
- in hwpoison stress testing, we can avoid such pages by checking the
PG_lru bit. Thus we can make the tests immune to this race.

Or,
- the page being __set_page_locked() is _not_ the fine LRU page
- we can prevent the kernel panic in the tests
- for a production workload, this presents merely another (rare) type
of kernel page we cannot rescue.

Thanks,
Fengguang

2009-09-28 02:57:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > There is no real rush AFAIKS to fix this one single pagecache site
> > while we have problems with slab allocators and all other unaudited
> > places that nonatomically modify page flags with an elevated
>
> hwpoison ignores slab pages.

"ignores" them *after* it has already written to page flags?
By that time it's too late.


> > page reference ... just mark HWPOISON as broken for the moment, or
> > cut it down to do something much simpler I guess?
>
> Erm no. These cases are *EXTREMLY* unlikely to hit.

Well it's fundamentally badly buggy, rare or not. We could avoid
lots of nasty atomic operations if we just care that it works
most of the time. I guess it's a matter of perspective but I
won't push for one thing or the other in hwpoison code so long
as it stays out of core code for the most part.

2009-09-28 04:11:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 04:57:41AM +0200, Nick Piggin wrote:
> On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > > There is no real rush AFAIKS to fix this one single pagecache site
> > > while we have problems with slab allocators and all other unaudited
> > > places that nonatomically modify page flags with an elevated
> >
> > hwpoison ignores slab pages.
>
> "ignores" them *after* it has already written to page flags?
> By that time it's too late.

Yes, currently the page lock comes first. The only exception
is for page count == 0 pages. I suppose we could move the slab check
up, but then it only helps when slab is set.

So if you make slab use refcount == 0 pages that would help.

> Well it's fundamentally badly buggy, rare or not. We could avoid

Let's put it like this -- any access to the poisoned cache lines
in that page will trigger a panic anyways.

-Andi

--
[email protected] -- Speaking for myself only.

2009-09-28 04:29:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 06:11:08AM +0200, Andi Kleen wrote:
> On Mon, Sep 28, 2009 at 04:57:41AM +0200, Nick Piggin wrote:
> > On Mon, Sep 28, 2009 at 03:19:43AM +0200, Andi Kleen wrote:
> > > > There is no real rush AFAIKS to fix this one single pagecache site
> > > > while we have problems with slab allocators and all other unaudited
> > > > places that nonatomically modify page flags with an elevated
> > >
> > > hwpoison ignores slab pages.
> >
> > "ignores" them *after* it has already written to page flags?
> > By that time it's too late.
>
> Yes, currently the page lock comes first. The only exception
> is for page count == 0 pages. I suppose we could move the slab check
> up, but then it only helps when slab is set.

Yes, so it misses other potential non-atomic page flags manipulations.


> So if you make slab use refcount == 0 pages that would help.

Yes it would help here and also help with the pagecache part too,
and most other cases I suspect. I have some patches to do this at
home so I'll post them when I get back.


> > Well it's fundamentally badly buggy, rare or not. We could avoid
>
> Let's put it like this -- any access to the poisoned cache lines
> in that page will trigger a panic anyways.

Well yes, although maybe people who care about this feature will
care more about having a reliable panic than introducing a
random data corruption. I guess the chance of an ecc failure
combined with a chance the race window hits could be some orders
of magnitude less likely than other sources of bugs ;) but still
I don't like using that argument to allow known bugs -- it leads
to interesting things if we take it to a conclusion.

2009-09-28 08:44:10

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > >
> > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> >
> > Sorry that's not correct. I improved the accounting by treating
> > function0+function1 from two CPUs as an integral entity:
> >
> > total time add_to_page_cache_lru percent stddev
> > before 3880166848.722 9683329.610 0.250% 0.014%
> > after 3828516894.376 9778088.870 0.256% 0.012%
> > delta 0.006%
>
> I don't understand why you're doing this NFS workload to measure?

Because it is the first convenient workload hit my mind, and avoids
real disk IO :)

> I see significant nfs, networking protocol and device overheads in
> your profiles, also you're hitting some locks or something which
> is causing massive context switching. So I don't think this is a
> good test.

Yes there are overheads. However it is a real and common workload.

> But anyway as Hugh points out, you need to compare with a
> *completely* fixed kernel, which includes auditing all users of page
> flags non-atomically (slab, notably, but possibly also other
> places).

That's good point. We can do more benchmarks when more fixes are
available. However I suspect their design goal will be "fix them
without introducing noticeable overheads" :)

> One other thing to keep in mind that I will mention is that I am
> going to push in a patch to the page allocator to allow callers
> to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> which is especially important for SLUB and takes more cycles off
> the page allocator...
>
> I don't know exactly what you're going to do after that to get a
> stable reference to slab pages. I guess you can read the page
> flags and speculatively take some slab locks and recheck etc...

For reliably we could skip page lock on zero refcounted pages.

We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
it should be an acceptable imperfection.

Thanks,
Fengguang

2009-09-29 05:16:32

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:

> > One other thing to keep in mind that I will mention is that I am
> > going to push in a patch to the page allocator to allow callers
> > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > which is especially important for SLUB and takes more cycles off
> > the page allocator...
> >
> > I don't know exactly what you're going to do after that to get a
> > stable reference to slab pages. I guess you can read the page
> > flags and speculatively take some slab locks and recheck etc...
>
> For reliably we could skip page lock on zero refcounted pages.
>
> We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> it should be an acceptable imperfection.

I'd like to propose this fix for 2.6.32, which can do 100% correctness
for the discussed races :)

In brief it is

if (is not lru page)
return and don't touch page lock;

Any comments?

Thanks,
Fengguang

---
HWPOISON: return early on non-LRU pages

This avoids unnecessary races with __set_page_locked() and
__SetPageSlab*() and maybe more non-atomic page flag operations.

Signed-off-by: Wu Fengguang <[email protected]>
---
mm/memory-failure.c | 54 ++++++++++++++----------------------------
1 file changed, 19 insertions(+), 35 deletions(-)

--- sound-2.6.orig/mm/memory-failure.c 2009-09-29 12:27:36.000000000 +0800
+++ sound-2.6/mm/memory-failure.c 2009-09-29 12:32:52.000000000 +0800
@@ -327,16 +327,6 @@ static const char *action_name[] = {
};

/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page *p, unsigned long pfn)
-{
- return DELAYED;
-}
-
-/*
* Already poisoned page.
*/
static int me_ignore(struct page *p, unsigned long pfn)
@@ -370,9 +360,6 @@ static int me_pagecache_clean(struct pag
int ret = FAILED;
struct address_space *mapping;

- if (!isolate_lru_page(p))
- page_cache_release(p);
-
/*
* For anonymous pages we're done the only reference left
* should be the one m_f() holds.
@@ -498,30 +485,18 @@ static int me_pagecache_dirty(struct pag
*/
static int me_swapcache_dirty(struct page *p, unsigned long pfn)
{
- int ret = FAILED;
-
ClearPageDirty(p);
/* Trigger EIO in shmem: */
ClearPageUptodate(p);

- if (!isolate_lru_page(p)) {
- page_cache_release(p);
- ret = DELAYED;
- }
-
- return ret;
+ return DELAYED;
}

static int me_swapcache_clean(struct page *p, unsigned long pfn)
{
- int ret = FAILED;
-
- if (!isolate_lru_page(p)) {
- page_cache_release(p);
- ret = RECOVERED;
- }
delete_from_swap_cache(p);
- return ret;
+
+ return RECOVERED;
}

/*
@@ -576,13 +551,6 @@ static struct page_state {
{ reserved, reserved, "reserved kernel", me_ignore },
{ buddy, buddy, "free kernel", me_free },

- /*
- * Could in theory check if slab page is free or if we can drop
- * currently unused objects without touching them. But just
- * treat it as standard kernel for now.
- */
- { slab, slab, "kernel slab", me_kernel },
-
#ifdef CONFIG_PAGEFLAGS_EXTENDED
{ head, head, "huge", me_huge_page },
{ tail, tail, "huge", me_huge_page },
@@ -775,6 +743,22 @@ int __memory_failure(unsigned long pfn,
}

/*
+ * We ignore non-LRU pages for good reasons.
+ * - PG_locked is only well defined for LRU pages and a few others
+ * - to avoid races with __set_page_locked()
+ * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
+ * The check (unnecessarily) ignores LRU pages being isolated and
+ * walked by the page reclaim code, however that's not a big loss.
+ */
+ if (!PageLRU(p))
+ lru_add_drain_all();
+ if (isolate_lru_page(p)) {
+ action_result(pfn, "non LRU", IGNORED);
+ return -EBUSY;
+ }
+ page_cache_release(p);
+
+ /*
* Lock the page and wait for writeback to finish.
* It's very difficult to mess with pages currently under IO
* and in many cases impossible, so we just avoid it here.

2009-10-01 02:02:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> > On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > >
> > > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > >
> > > Sorry that's not correct. I improved the accounting by treating
> > > function0+function1 from two CPUs as an integral entity:
> > >
> > > total time add_to_page_cache_lru percent stddev
> > > before 3880166848.722 9683329.610 0.250% 0.014%
> > > after 3828516894.376 9778088.870 0.256% 0.012%
> > > delta 0.006%
> >
> > I don't understand why you're doing this NFS workload to measure?
>
> Because it is the first convenient workload hit my mind, and avoids
> real disk IO :)

Using tmpfs or sparse files is probably a lot easier.


> > I see significant nfs, networking protocol and device overheads in
> > your profiles, also you're hitting some locks or something which
> > is causing massive context switching. So I don't think this is a
> > good test.
>
> Yes there are overheads. However it is a real and common workload.

Right, but so are lots of other workloads that don't hit
add_to_page_cache heavily :)


> > But anyway as Hugh points out, you need to compare with a
> > *completely* fixed kernel, which includes auditing all users of page
> > flags non-atomically (slab, notably, but possibly also other
> > places).
>
> That's good point. We can do more benchmarks when more fixes are
> available. However I suspect their design goal will be "fix them
> without introducing noticeable overheads" :)

s/noticeable//

The problem with all the non-noticeable overheads that we're
continually adding to the kernel is that we're adding them to
the kernel. Non-noticeable part only makes it worse because
you can't bisect them :)


> > One other thing to keep in mind that I will mention is that I am
> > going to push in a patch to the page allocator to allow callers
> > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > which is especially important for SLUB and takes more cycles off
> > the page allocator...
> >
> > I don't know exactly what you're going to do after that to get a
> > stable reference to slab pages. I guess you can read the page
> > flags and speculatively take some slab locks and recheck etc...
>
> For reliably we could skip page lock on zero refcounted pages.
>
> We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> it should be an acceptable imperfection.

I think if you're wiling to accept these problems, then it is
completely reasonable to also accept similar races with kernel
fastpaths to avoid extra overheads there.

2009-10-02 10:55:14

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] HWPOISON: remove the unsafe __set_page_locked()

On Thu, Oct 01, 2009 at 10:02:07AM +0800, Nick Piggin wrote:
> On Mon, Sep 28, 2009 at 04:44:01PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 28, 2009 at 03:20:25AM +0800, Nick Piggin wrote:
> > > On Sun, Sep 27, 2009 at 06:47:39PM +0800, Wu Fengguang wrote:
> > > > >
> > > > > And standard deviation is 0.04%, much larger than the difference 0.008% ..
> > > >
> > > > Sorry that's not correct. I improved the accounting by treating
> > > > function0+function1 from two CPUs as an integral entity:
> > > >
> > > > total time add_to_page_cache_lru percent stddev
> > > > before 3880166848.722 9683329.610 0.250% 0.014%
> > > > after 3828516894.376 9778088.870 0.256% 0.012%
> > > > delta 0.006%
> > >
> > > I don't understand why you're doing this NFS workload to measure?
> >
> > Because it is the first convenient workload hit my mind, and avoids
> > real disk IO :)
>
> Using tmpfs or sparse files is probably a lot easier.

Good ideas. In fact I tried them in the very beginning.
The ratios are roughly at the same level (which is somehow unexpected):

total time add_to_page_cache_lru percent stddev
tmpfs 1579056274.576 2615476.234 0.1656354036338758%
sparse 1074931917.425 3001273 0.27920586888791538%

Workload is to copy 1G /dev/zero to /dev/shm/, or 1G sparse file
(ext2) to /dev/null.

echo 1 > /debug/tracing/function_profile_enabled
cp /dev/zero /dev/shm/
echo 0 > /debug/tracing/function_profile_enabled

dd if=/dev/zero of=/mnt/test bs=1k count=1 seek=1048575
echo 1 > /debug/tracing/function_profile_enabled
cp /mnt/test/sparse /dev/null
echo 0 > /debug/tracing/function_profile_enabled

> > > I see significant nfs, networking protocol and device overheads in
> > > your profiles, also you're hitting some locks or something which
> > > is causing massive context switching. So I don't think this is a
> > > good test.
> >
> > Yes there are overheads. However it is a real and common workload.
>
> Right, but so are lots of other workloads that don't hit
> add_to_page_cache heavily :)
>
>
> > > But anyway as Hugh points out, you need to compare with a
> > > *completely* fixed kernel, which includes auditing all users of page
> > > flags non-atomically (slab, notably, but possibly also other
> > > places).
> >
> > That's good point. We can do more benchmarks when more fixes are
> > available. However I suspect their design goal will be "fix them
> > without introducing noticeable overheads" :)
>
> s/noticeable//
>
> The problem with all the non-noticeable overheads that we're
> continually adding to the kernel is that we're adding them to
> the kernel. Non-noticeable part only makes it worse because
> you can't bisect them :)

Yes it makes sense.

> > > One other thing to keep in mind that I will mention is that I am
> > > going to push in a patch to the page allocator to allow callers
> > > to avoid the refcounting (atomic_dec_and_test) in page lifetime,
> > > which is especially important for SLUB and takes more cycles off
> > > the page allocator...
> > >
> > > I don't know exactly what you're going to do after that to get a
> > > stable reference to slab pages. I guess you can read the page
> > > flags and speculatively take some slab locks and recheck etc...
> >
> > For reliably we could skip page lock on zero refcounted pages.
> >
> > We may lose the PG_hwpoison bit on races with __SetPageSlub*, however
> > it should be an acceptable imperfection.
>
> I think if you're wiling to accept these problems, then it is
> completely reasonable to also accept similar races with kernel
> fastpaths to avoid extra overheads there.

Yes I do. Even better, for this perticular race, we managed to avoid
it completely without introducing overheads in fast path :)

Thanks,
Fengguang