From bd4149290c3edc09454a8a7e7ef3a5544cb9eed6 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 24 Oct 2022 17:46:18 +0000 Subject: Docs/admin-guide/mm/damon/usage: describe the rules of sysfs region directories Patch series "Docs/admin-buide/mm/damon/usage: minor fixes". DAMON usage document contains an unclear description and a wrong usage example. This patchset fixes the two minor problems. This patch (of 2): Target region directories of DAMON sysfs interface should contain no overlap and sorted by the address, but not clearly documented. Actually, a user had an issue[1] due to the poor documentation. Add clear description of it on the usage document. [1] https://lore.kernel.org/damon/CAEZ6=UNUcH2BvJj++OrT=XQLdkidU79wmCO=tantSOB36pPNTg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221024174619.15600-1-sj@kernel.org Link: https://lkml.kernel.org/r/20221024174619.15600-2-sj@kernel.org Signed-off-by: SeongJae Park Reported-by: Vinicius Petrucci Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 3 +++ 1 file changed, 3 insertions(+) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index b47b0cbbd491..89d9a4f75a29 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -235,6 +235,9 @@ In each region directory, you will find two files (``start`` and ``end``). You can set and get the start and end addresses of the initial monitoring target region by writing to and reading from the files, respectively. +Each region should not overlap with others. ``end`` of directory ``N`` should +be equal or smaller than ``start`` of directory ``N+1``. + contexts//schemes/ --------------------- -- cgit v1.2.3 From 1b0166387586cae69d7da783f0a4521864534aad Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 24 Oct 2022 17:46:19 +0000 Subject: Docs/admin-guide/mm/damon/usage: fix wrong usage example of init_regions file DAMON debugfs interface assumes the users will write all inputs at once. However, redirecting a string of multiple lines sometimes end up writing line by line. Therefore, the example usage of 'init_regions' file, which writes input as a string of multiple lines can fail. Fix it to use a single line string instead. Also update the description of the usage to not assume users will write inputs in multiple lines. Link: https://lkml.kernel.org/r/20221024174619.15600-3-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Vinicius Petrucci Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 89d9a4f75a29..c17e02e1e426 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -468,8 +468,9 @@ regions in case of physical memory monitoring. Therefore, users should set the monitoring target regions by themselves. In such cases, users can explicitly set the initial monitoring target regions -as they want, by writing proper values to the ``init_regions`` file. Each line -of the input should represent one region in below form.:: +as they want, by writing proper values to the ``init_regions`` file. The input +should be a sequence of three integers separated by white spaces that represent +one region in below form.:: @@ -484,9 +485,9 @@ ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one # cd /damon # cat target_ids 42 4242 - # echo "0 1 100 - 0 100 200 - 1 20 40 + # echo "0 1 100 \ + 0 100 200 \ + 1 20 40 \ 1 50 100" > init_regions Note that this sets the initial monitoring target regions only. In case of -- cgit v1.2.3 From 57e9cc50f4dd926d6c38751799d25cad89fb2bd9 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 26 Oct 2022 14:01:33 -0400 Subject: mm: vmscan: split khugepaged stats from direct reclaim stats Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. However, khugepaged currently distorts the picture: as a kernel thread it doesn't impose allocation latencies on userspace, and it explicitly opts out of kswapd reclaim. Its activity showing up in the direct reclaim stats is misleading. Counting it as kswapd reclaim could also cause confusion when trying to understand actual kswapd behavior. Break out khugepaged from the direct reclaim counters into new pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): pgsteal_kswapd 1342185 pgsteal_direct 0 pgsteal_khugepaged 3623 pgscan_kswapd 1345025 pgscan_direct 0 pgscan_khugepaged 3623 Link: https://lkml.kernel.org/r/20221026180133.377671-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Eric Bergen Cc: Matthew Wilcox (Oracle) Cc: Yang Shi Cc: Yosry Ahmed Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 6 ++++++ include/linux/khugepaged.h | 6 ++++++ include/linux/vm_event_item.h | 3 +++ mm/khugepaged.c | 5 +++++ mm/memcontrol.c | 8 ++++++-- mm/vmscan.c | 32 ++++++++++++++++++++++++-------- mm/vmstat.c | 3 +++ 7 files changed, 53 insertions(+), 10 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index dc254a3cb956..74cec76be9f2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1488,12 +1488,18 @@ PAGE_SIZE multiple when read back. pgscan_direct (npn) Amount of scanned pages directly (in an inactive LRU list) + pgscan_khugepaged (npn) + Amount of scanned pages by khugepaged (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd pgsteal_direct (npn) Amount of reclaimed pages directly + pgsteal_khugepaged (npn) + Amount of reclaimed pages by khugepaged + pgfault (npn) Total number of page faults incurred diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 70162d707caf..f68865e19b0b 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -15,6 +15,7 @@ extern void __khugepaged_exit(struct mm_struct *mm); extern void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags); extern void khugepaged_min_free_kbytes_update(void); +extern bool current_is_khugepaged(void); #ifdef CONFIG_SHMEM extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd); @@ -57,6 +58,11 @@ static inline int collapse_pte_mapped_thp(struct mm_struct *mm, static inline void khugepaged_min_free_kbytes_update(void) { } + +static inline bool current_is_khugepaged(void) +{ + return false; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_KHUGEPAGED_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 3518dba1e02f..7f5d1caf5890 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -40,10 +40,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGREUSE, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, + PGDEMOTE_KHUGEPAGED, PGSCAN_KSWAPD, PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, PGSCAN_DIRECT_THROTTLE, PGSCAN_ANON, PGSCAN_FILE, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3703a56571c1..9c111273bbf9 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2577,6 +2577,11 @@ void khugepaged_min_free_kbytes_update(void) mutex_unlock(&khugepaged_mutex); } +bool current_is_khugepaged(void) +{ + return kthread_func(current) == khugepaged; +} + static int madvise_collapse_errno(enum scan_result r) { /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c95e2ed6e7fd..23750cec0036 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -661,8 +661,10 @@ static const unsigned int memcg_vm_event_stat[] = { PGPGOUT, PGSCAN_KSWAPD, PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, PGFAULT, PGMAJFAULT, PGREFILL, @@ -1574,10 +1576,12 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize) /* Accumulated memory events */ seq_buf_printf(&s, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) + - memcg_events(memcg, PGSCAN_DIRECT)); + memcg_events(memcg, PGSCAN_DIRECT) + + memcg_events(memcg, PGSCAN_KHUGEPAGED)); seq_buf_printf(&s, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) + - memcg_events(memcg, PGSTEAL_DIRECT)); + memcg_events(memcg, PGSTEAL_DIRECT) + + memcg_events(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { if (memcg_vm_event_stat[i] == PGPGIN || diff --git a/mm/vmscan.c b/mm/vmscan.c index 55a5b5d66d68..d7c71be6417d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include @@ -1047,6 +1048,24 @@ void drop_slab(void) drop_slab_node(nid); } +static int reclaimer_offset(void) +{ + BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != + PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD); + BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != + PGSCAN_DIRECT - PGSCAN_KSWAPD); + BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != + PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD); + BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != + PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD); + + if (current_is_kswapd()) + return 0; + if (current_is_khugepaged()) + return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD; + return PGSTEAL_DIRECT - PGSTEAL_KSWAPD; +} + static inline int is_page_cache_freeable(struct folio *folio) { /* @@ -1599,10 +1618,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); - if (current_is_kswapd()) - __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); - else - __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); + __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); return nr_succeeded; } @@ -2475,7 +2491,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, &nr_scanned, sc, lru); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + item = PGSCAN_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); @@ -2492,7 +2508,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, move_folios_to_lru(lruvec, &folio_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); @@ -4871,7 +4887,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, break; } - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + item = PGSCAN_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) { __count_vm_events(item, isolated); __count_vm_events(PGREFILL, sorted); @@ -5049,7 +5065,7 @@ retry: if (walk && walk->batched) reset_batch_size(lruvec, walk); - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) __count_vm_events(item, reclaimed); __count_memcg_events(memcg, item, reclaimed); diff --git a/mm/vmstat.c b/mm/vmstat.c index b2371d745e00..1ea6a5ce1c41 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1271,10 +1271,13 @@ const char * const vmstat_text[] = { "pgreuse", "pgsteal_kswapd", "pgsteal_direct", + "pgsteal_khugepaged", "pgdemote_kswapd", "pgdemote_direct", + "pgdemote_khugepaged", "pgscan_kswapd", "pgscan_direct", + "pgscan_khugepaged", "pgscan_direct_throttle", "pgscan_anon", "pgscan_file", -- cgit v1.2.3 From 7f0a86f3c99bc9736445ef64aa65c9bd6161a47b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 1 Nov 2022 22:03:27 +0000 Subject: Docs/admin-guide/mm/damon/usage: document schemes//tried_regions sysfs directory Document 'tried_regions' directory in DAMON sysfs interface usage in the administrator guide. Link: https://lkml.kernel.org/r/20221101220328.95765-8-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 45 ++++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 3 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index c17e02e1e426..1a5b6b71efa1 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -88,6 +88,9 @@ comma (","). :: │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ tried_regions/ + │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age + │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... │ │ ... @@ -125,7 +128,14 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the user inputs in the sysfs files except ``state`` file again. Writing ``update_schemes_stats`` to ``state`` file updates the contents of stats files for each DAMON-based operation scheme of the kdamond. For details of the -stats, please refer to :ref:`stats section `. +stats, please refer to :ref:`stats section `. Writing +``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based +operation scheme action tried regions directory for each DAMON-based operation +scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state`` +file clears the DAMON-based operating scheme action tried regions directory for +each DAMON-based operation scheme of the kdamond. For details of the +DAMON-based operation scheme action tried regions directory, please refer to +:ref:tried_regions section `. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. @@ -166,6 +176,8 @@ You can set and get what type of monitoring operations DAMON will use for the context by writing one of the keywords listed in ``avail_operations`` file and reading from the ``operations`` file. +.. _sysfs_monitoring_attrs: + contexts//monitoring_attrs/ ------------------------------ @@ -255,8 +267,9 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes// ------------ -In each scheme directory, four directories (``access_pattern``, ``quotas``, -``watermarks``, and ``stats``) and one file (``action``) exist. +In each scheme directory, five directories (``access_pattern``, ``quotas``, +``watermarks``, ``stats``, and ``tried_regions``) and one file (``action``) +exist. The ``action`` file is for setting and getting what action you want to apply to memory regions having specific access pattern of the interest. The keywords @@ -351,6 +364,32 @@ should ask DAMON sysfs interface to updte the content of the files for the stats by writing a special keyword, ``update_schemes_stats`` to the relevant ``kdamonds//state`` file. +.. _sysfs_schemes_tried_regions: + +schemes//tried_regions/ +-------------------------- + +When a special keyword, ``update_schemes_tried_regions``, is written to the +relevant ``kdamonds//state`` file, DAMON creates directories named integer +starting from ``0`` under this directory. Each directory contains files +exposing detailed information about each of the memory region that the +corresponding scheme's ``action`` has tried to be applied under this directory, +during next :ref:`aggregation interval `. The +information includes address range, ``nr_accesses``, , and ``age`` of the +region. + +The directories will be removed when another special keyword, +``clear_schemes_tried_regions``, is written to the relevant +``kdamonds//state`` file. + +tried_regions// +------------------ + +In each region directory, you will find four files (``start``, ``end``, +``nr_accesses``, and ``age``). Reading the files will show the start and end +addresses, ``nr_accesses``, and ``age`` of the region that corresponding +DAMON-based operation scheme ``action`` has tried to be applied. + Example ~~~~~~~ -- cgit v1.2.3 From 60e9b39ebec56467c36c3da76eee28083196cdf1 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 9 Nov 2022 20:50:39 +0900 Subject: zram: add recompress flag to read_block_state() Add a new flag to zram block state that shows if the page was recompressed (using alternative compression algorithm). Link: https://lkml.kernel.org/r/20221109115047.2921851-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Alexey Romanov Cc: Nhat Pham Cc: Nitin Gupta Cc: Suleiman Souhlal Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 9 ++++++--- drivers/block/zram/zram_drv.c | 5 +++-- 2 files changed, 9 insertions(+), 5 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index c73b16930449..177a142c3146 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -411,9 +411,10 @@ pages of the process with*pagemap. If you enable the feature, you could see block state via /sys/kernel/debug/zram/zram0/block_state". The output is as follows:: - 300 75.033841 .wh. - 301 63.806904 s... - 302 63.806919 ..hi + 300 75.033841 .wh.. + 301 63.806904 s.... + 302 63.806919 ..hi. + 303 62.801919 ....r First column zram's block index. @@ -430,6 +431,8 @@ Third column huge page i: idle page + r: + recompressed page (secondary compression algorithm) First line of above example says 300th block is accessed at 75.033841sec and the block's state is huge so it is written back to the backing diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 97300b3a83c3..ddbfa70ef9a3 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -936,13 +936,14 @@ static ssize_t read_block_state(struct file *file, char __user *buf, ts = ktime_to_timespec64(zram->table[index].ac_time); copied = snprintf(kbuf + written, count, - "%12zd %12lld.%06lu %c%c%c%c\n", + "%12zd %12lld.%06lu %c%c%c%c%c\n", index, (s64)ts.tv_sec, ts.tv_nsec / NSEC_PER_USEC, zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.', zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.', zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.', - zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.'); + zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.', + zram_get_priority(zram, index) ? 'r' : '.'); if (count <= copied) { zram_slot_unlock(zram, index); -- cgit v1.2.3 From 443dd798062c1549e790539e572cbda4b7a8df30 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 9 Nov 2022 20:50:45 +0900 Subject: documentation: add zram recompression documentation Document user-space visible device attributes that are enabled by ZRAM_MULTI_COMP. Link: https://lkml.kernel.org/r/20221109115047.2921851-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Alexey Romanov Cc: Nhat Pham Cc: Nitin Gupta Cc: Suleiman Souhlal Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 81 +++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 177a142c3146..d898b7ace33d 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -401,6 +401,87 @@ budget in next setting is user's job. If admin wants to measure writeback count in a certain period, they could know it via /sys/block/zram0/bd_stat's 3rd column. +recompression +------------- + +With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative +(secondary) compression algorithms. The basic idea is that alternative +compression algorithm can provide better compression ratio at a price of +(potentially) slower compression/decompression speeds. Alternative compression +algorithm can, for example, be more successful compressing huge pages (those +that default algorithm failed to compress). Another application is idle pages +recompression - pages that are cold and sit in the memory can be recompressed +using more effective algorithm and, hence, reduce zsmalloc memory usage. + +With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: +one primary and up to 3 secondary ones. Primary zram compressor is explained +in "3) Select compression algorithm", secondary algorithms are configured +using recomp_algorithm device attribute. + +Example::: + + #show supported recompression algorithms + cat /sys/block/zramX/recomp_algorithm + #1: lzo lzo-rle lz4 lz4hc [zstd] + #2: lzo lzo-rle lz4 [lz4hc] zstd + +Alternative compression algorithms are sorted by priority. In the example +above, zstd is used as the first alternative algorithm, which has priority +of 1, while lz4hc is configured as a compression algorithm with priority 2. +Alternative compression algorithm's priority is provided during algorithms +configuration::: + + #select zstd recompression algorithm, priority 1 + echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm + + #select deflate recompression algorithm, priority 2 + echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm + +Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, +which controls recompression. + +Examples::: + + #IDLE pages recompression is activated by `idle` mode + echo "type=idle" > /sys/block/zramX/recompress + + #HUGE pages recompression is activated by `huge` mode + echo "type=huge" > /sys/block/zram0/recompress + + #HUGE_IDLE pages recompression is activated by `huge_idle` mode + echo "type=huge_idle" > /sys/block/zramX/recompress + +The number of idle pages can be significant, so user-space can pass a size +threshold (in bytes) to the recompress knob: zram will recompress only pages +of equal or greater size::: + + #recompress all pages larger than 3000 bytes + echo "threshold=3000" > /sys/block/zramX/recompress + + #recompress idle pages larger than 2000 bytes + echo "type=idle threshold=2000" > /sys/block/zramX/recompress + +Recompression of idle pages requires memory tracking. + +During re-compression for every page, that matches re-compression criteria, +ZRAM iterates the list of registered alternative compression algorithms in +order of their priorities. ZRAM stops either when re-compression was +successful (re-compressed object is smaller in size than the original one) +and matches re-compression criteria (e.g. size threshold) or when there are +no secondary algorithms left to try. If none of the secondary algorithms can +successfully re-compressed the page such a page is marked as incompressible, +so ZRAM will not attempt to re-compress it in the future. + +This re-compression behaviour, when it iterates through the list of +registered compression algorithms, increases our chances of finding the +algorithm that successfully compresses a particular page. Sometimes, however, +it is convenient (and sometimes even necessary) to limit recompression to +only one particular algorithm so that it will not try any other algorithms. +This can be achieved by providing a algo=NAME parameter::: + + #use zstd algorithm only (if registered) + echo "type=huge algo=zstd" > /sys/block/zramX/recompress + memory tracking =============== -- cgit v1.2.3 From b46f9ea3cb351587b2cfc68f7211f7a7cc5b6673 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 9 Nov 2022 20:50:46 +0900 Subject: zram: add incompressible writeback Add support for incompressible pages writeback: echo incompressible > /sys/block/zramX/writeback Link: https://lkml.kernel.org/r/20221109115047.2921851-13-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Alexey Romanov Cc: Nhat Pham Cc: Nitin Gupta Cc: Suleiman Souhlal Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 7 ++++++- drivers/block/zram/zram_drv.c | 18 ++++++++++++------ 2 files changed, 18 insertions(+), 7 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index d898b7ace33d..f14c8c2e42f3 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -348,8 +348,13 @@ this can be accomplished with:: echo huge_idle > /sys/block/zramX/writeback +If a user chooses to writeback only incompressible pages (pages that none of +algorithms can compress) this can be accomplished with:: + + echo incompressible > /sys/block/zramX/writeback + If an admin wants to write a specific page in zram device to the backing device, -they could write a page index into the interface. +they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 798c421fdd36..25b7ff2b56bf 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -645,10 +645,10 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec, #define PAGE_WB_SIG "page_index=" -#define PAGE_WRITEBACK 0 -#define HUGE_WRITEBACK (1<<0) -#define IDLE_WRITEBACK (1<<1) - +#define PAGE_WRITEBACK 0 +#define HUGE_WRITEBACK (1<<0) +#define IDLE_WRITEBACK (1<<1) +#define INCOMPRESSIBLE_WRITEBACK (1<<2) static ssize_t writeback_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) @@ -669,6 +669,8 @@ static ssize_t writeback_store(struct device *dev, mode = HUGE_WRITEBACK; else if (sysfs_streq(buf, "huge_idle")) mode = IDLE_WRITEBACK | HUGE_WRITEBACK; + else if (sysfs_streq(buf, "incompressible")) + mode = INCOMPRESSIBLE_WRITEBACK; else { if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1)) return -EINVAL; @@ -731,11 +733,15 @@ static ssize_t writeback_store(struct device *dev, goto next; if (mode & IDLE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_IDLE)) + !zram_test_flag(zram, index, ZRAM_IDLE)) goto next; if (mode & HUGE_WRITEBACK && - !zram_test_flag(zram, index, ZRAM_HUGE)) + !zram_test_flag(zram, index, ZRAM_HUGE)) goto next; + if (mode & INCOMPRESSIBLE_WRITEBACK && + !zram_test_flag(zram, index, ZRAM_INCOMPRESSIBLE)) + goto next; + /* * Clearing ZRAM_UNDER_WB is duty of caller. * IOW, zram_free_page never clear it. -- cgit v1.2.3 From 77db7bb56bd711586243924a5582727f7a93fb7f Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 9 Nov 2022 20:50:47 +0900 Subject: zram: add incompressible flag to read_block_state() Add a new flag to zram block state that shows if the page is incompressible: that none of the algorithm (including secondary ones) could compress it. Link: https://lkml.kernel.org/r/20221109115047.2921851-14-senozhatsky@chromium.org Suggested-by: Minchan Kim Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Alexey Romanov Cc: Nhat Pham Cc: Nitin Gupta Cc: Suleiman Souhlal Signed-off-by: Andrew Morton --- Documentation/admin-guide/blockdev/zram.rst | 11 +++++++---- drivers/block/zram/zram_drv.c | 6 ++++-- 2 files changed, 11 insertions(+), 6 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index f14c8c2e42f3..e4551579cb12 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -497,10 +497,11 @@ pages of the process with*pagemap. If you enable the feature, you could see block state via /sys/kernel/debug/zram/zram0/block_state". The output is as follows:: - 300 75.033841 .wh.. - 301 63.806904 s.... - 302 63.806919 ..hi. - 303 62.801919 ....r + 300 75.033841 .wh... + 301 63.806904 s..... + 302 63.806919 ..hi.. + 303 62.801919 ....r. + 304 146.781902 ..hi.n First column zram's block index. @@ -519,6 +520,8 @@ Third column idle page r: recompressed page (secondary compression algorithm) + n: + none (including secondary) of algorithms could compress it First line of above example says 300th block is accessed at 75.033841sec and the block's state is huge so it is written back to the backing diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 25b7ff2b56bf..9d33801e8ba8 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -946,14 +946,16 @@ static ssize_t read_block_state(struct file *file, char __user *buf, ts = ktime_to_timespec64(zram->table[index].ac_time); copied = snprintf(kbuf + written, count, - "%12zd %12lld.%06lu %c%c%c%c%c\n", + "%12zd %12lld.%06lu %c%c%c%c%c%c\n", index, (s64)ts.tv_sec, ts.tv_nsec / NSEC_PER_USEC, zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.', zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.', zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.', zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.', - zram_get_priority(zram, index) ? 'r' : '.'); + zram_get_priority(zram, index) ? 'r' : '.', + zram_test_flag(zram, index, + ZRAM_INCOMPRESSIBLE) ? 'n' : '.'); if (count <= copied) { zram_slot_unlock(zram, index); -- cgit v1.2.3 From 9b34a307f39497198645de5e43f3f00b5e873249 Mon Sep 17 00:00:00 2001 From: Jian Wen Date: Fri, 11 Nov 2022 11:46:39 +0800 Subject: docs: admin-guide: cgroup-v1: update description of inactive_file MADV_FREE pages have been moved into the LRU_INACTIVE_FILE list by commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into LRU_INACTIVE_FILE list"). Link: https://lkml.kernel.org/r/20221111034639.3593380-1-wenjian1@xiaomi.com Signed-off-by: Jian Wen Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 5b86245450bd..60370f2c67b9 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -543,7 +543,8 @@ inactive_anon # of bytes of anonymous and swap cache memory on inactive LRU list. active_anon # of bytes of anonymous and swap cache memory on active LRU list. -inactive_file # of bytes of file-backed memory on inactive LRU list. +inactive_file # of bytes of file-backed memory and MADV_FREE anonymous memory( + LazyFree pages) on inactive LRU list. active_file # of bytes of file-backed memory on active LRU list. unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). =============== =============================================================== -- cgit v1.2.3 From 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Fri, 2 Dec 2022 14:35:31 -0800 Subject: mm: add nodes= arg to memory.reclaim The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system: nodes 0,1 -> top tier nodes 2,3 -> second tier $ echo "1m nodes=0" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory from node 0. Since node 0 is a top tier node, demotion will be attempted first. This is useful to direct proactive reclaim to specific nodes that are under pressure. $ echo "1m nodes=2,3" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory in the second tier, since this tier of memory has no demotion targets the memory will be reclaimed. $ echo "1m nodes=0,1" > memory.reclaim Instructs the kernel to reclaim memory from the top tier nodes, which can be desirable according to the userspace policy if there is pressure on the top tiers. Since these nodes have demotion targets, the kernel will attempt demotion first. Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault. The 'nodes' arg is useful to allow the userspace to control demotion and reclaim independently according to its policy: if the memory.reclaim is called on a node with demotion targets, it will attempt demotion first; if it is called on a node without demotion targets, it will only attempt reclaim. Link: https://lkml.kernel.org/r/20221202223533.1785418-1-almasrymina@google.com Signed-off-by: Mina Almasry Acked-by: Michal Hocko Acked-by: Shakeel Butt Acked-by: Muchun Song Cc: Bagas Sanjaya Cc: "Huang, Ying" Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Roman Gushchin Cc: Tejun Heo Cc: Wei Xu Cc: Yang Shi Cc: Yosry Ahmed Cc: zefan li Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 15 +++++--- include/linux/swap.h | 3 +- mm/memcontrol.c | 67 ++++++++++++++++++++++++++------- mm/vmscan.c | 4 +- 4 files changed, 68 insertions(+), 21 deletions(-) (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 74cec76be9f2..c8ae7c897f14 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. + This file accepts a string which contains the number of bytes to + reclaim. Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. + This file also allows the user to specify the nodes to reclaim from, + via the 'nodes=' key, for example:: + + echo "1G nodes=0,1" > memory.reclaim + + The above instructs the kernel to reclaim memory from nodes 0,1. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index 0ceed49516ad..2787b84eaf12 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -418,7 +418,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + nodemask_t *nodemask); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2c7a91689fef..ff65bc23be13 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2683,7 +2685,8 @@ retry: psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, + NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, + NULL)) { ret = -EBUSY; break; } @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, + NULL)) nr_retries--; } @@ -6418,7 +6423,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL); if (!reclaimed && !nr_retries--) break; @@ -6467,7 +6473,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL)) nr_reclaims--; continue; } @@ -6590,21 +6597,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_NODES = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_NODES, "nodes=%s" }, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; - unsigned int reclaim_options; - int err; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP | + MEMCG_RECLAIM_PROACTIVE; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int token; + char value[256]; + nodemask_t nodemask = NODE_MASK_ALL; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; - reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + token = match_token(start, if_tokens, args); + match_strlcpy(value, args, sizeof(value)); + switch (token) { + case MEMORY_RECLAIM_NODES: + if (nodelist_parse(value, nodemask) < 0) + return -EINVAL; + break; + default: + return -EINVAL; + } + } + while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed; @@ -6621,7 +6661,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + &nodemask); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 1a59171c6695..2b42ac9ad755 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6758,7 +6758,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + nodemask_t *nodemask) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6773,6 +6774,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .nodemask = nodemask, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put -- cgit v1.2.3