summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorRoman Gushchin <guro@fb.com>2020-08-06 23:20:32 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2020-08-07 11:33:24 -0700
commiteedc4e5a142cc33fbb54f8d72b929a0e123c48c4 (patch)
treeaec63d54757fb9a1c1eedf5ff0b1235a46db58bc
parentd648bcc7fe65f09ecd19091f68395dfb3b7a87c8 (diff)
downloadlinux-eedc4e5a142cc33fbb54f8d72b929a0e123c48c4.tar.bz2
mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
Patch series "The new cgroup slab memory controller", v7. The patchset moves the accounting from the page level to the object level. It allows to share slab pages between memory cgroups. This leads to a significant win in the slab utilization (up to 45%) and the corresponding drop in the total kernel memory footprint. The reduced number of unmovable slab pages should also have a positive effect on the memory fragmentation. The patchset makes the slab accounting code simpler: there is no more need in the complicated dynamic creation and destruction of per-cgroup slab caches, all memory cgroups use a global set of shared slab caches. The lifetime of slab caches is not more connected to the lifetime of memory cgroups. The more precise accounting does require more CPU, however in practice the difference seems to be negligible. We've been using the new slab controller in Facebook production for several months with different workloads and haven't seen any noticeable regressions. What we've seen were memory savings in order of 1 GB per host (it varied heavily depending on the actual workload, size of RAM, number of CPUs, memory pressure, etc). The third version of the patchset added yet another step towards the simplification of the code: sharing of slab caches between accounted and non-accounted allocations. It comes with significant upsides (most noticeable, a complete elimination of dynamic slab caches creation) but not without some regression risks, so this change sits on top of the patchset and is not completely merged in. So in the unlikely event of a noticeable performance regression it can be reverted separately. The slab memory accounting works in exactly the same way for SLAB and SLUB. With both allocators the new controller shows significant memory savings, with SLUB the difference is bigger. On my 16-core desktop machine running Fedora 32 the size of the slab memory measured after the start of the system was lower by 58% and 38% with SLUB and SLAB correspondingly. As an estimation of a potential CPU overhead, below are results of slab_bulk_test01 test, kindly provided by Jesper D. Brouer. He also helped with the evaluation of results. The test can be found here: https://github.com/netoptimizer/prototype-kernel/ The smallest number in each row should be picked for a comparison. SLUB-patched - bulk-API - SLUB-patched : bulk_quick_reuse objects=1 : 187 - 90 - 224 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=2 : 110 - 53 - 133 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=3 : 88 - 95 - 42 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=4 : 91 - 85 - 36 cycles(tsc) - SLUB-patched : bulk_quick_reuse objects=8 : 32 - 66 - 32 cycles(tsc) SLUB-original - bulk-API - SLUB-original: bulk_quick_reuse objects=1 : 87 - 87 - 142 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=2 : 52 - 53 - 53 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=3 : 42 - 42 - 91 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=4 : 91 - 37 - 37 cycles(tsc) - SLUB-original: bulk_quick_reuse objects=8 : 31 - 79 - 76 cycles(tsc) SLAB-patched - bulk-API - SLAB-patched : bulk_quick_reuse objects=1 : 67 - 67 - 140 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=2 : 55 - 46 - 46 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=3 : 93 - 94 - 39 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=4 : 35 - 88 - 85 cycles(tsc) - SLAB-patched : bulk_quick_reuse objects=8 : 30 - 30 - 30 cycles(tsc) SLAB-original- bulk-API - SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 - 67 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=2 : 45 - 46 - 46 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=3 : 38 - 39 - 39 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=4 : 35 - 87 - 87 cycles(tsc) - SLAB-original: bulk_quick_reuse objects=8 : 29 - 66 - 30 cycles(tsc) This patch (of 19): To convert memcg and lruvec slab counters to bytes there must be a way to change these counters without touching node counters. Factor out __mod_memcg_lruvec_state() out of __mod_lruvec_state(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-1-guro@fb.com Link: http://lkml.kernel.org/r/20200623174037.3951353-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--include/linux/memcontrol.h17
-rw-r--r--mm/memcontrol.c43
2 files changed, 41 insertions, 19 deletions
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e77197a62809..b250f8197710 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -679,11 +679,23 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return x;
}
+void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+ int val);
void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
void mod_memcg_obj_state(void *p, int idx, int val);
+static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+ enum node_stat_item idx, int val)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __mod_memcg_lruvec_state(lruvec, idx, val);
+ local_irq_restore(flags);
+}
+
static inline void mod_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
@@ -1057,6 +1069,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
}
+static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+ enum node_stat_item idx, int val)
+{
+}
+
static inline void __mod_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 24892a14cc75..5863ceb310fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -713,30 +713,13 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
return mem_cgroup_nodeinfo(parent, nid);
}
-/**
- * __mod_lruvec_state - update lruvec memory statistics
- * @lruvec: the lruvec
- * @idx: the stat item
- * @val: delta to add to the counter, can be negative
- *
- * The lruvec is the intersection of the NUMA node and a cgroup. This
- * function updates the all three counters that are affected by a
- * change of state at this level: per-node, per-cgroup, per-lruvec.
- */
-void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val)
+void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+ int val)
{
- pg_data_t *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup_per_node *pn;
struct mem_cgroup *memcg;
long x;
- /* Update node */
- __mod_node_page_state(pgdat, idx, val);
-
- if (mem_cgroup_disabled())
- return;
-
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
memcg = pn->memcg;
@@ -748,6 +731,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+ pg_data_t *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup_per_node *pi;
for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
@@ -757,6 +741,27 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
__this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
}
+/**
+ * __mod_lruvec_state - update lruvec memory statistics
+ * @lruvec: the lruvec
+ * @idx: the stat item
+ * @val: delta to add to the counter, can be negative
+ *
+ * The lruvec is the intersection of the NUMA node and a cgroup. This
+ * function updates the all three counters that are affected by a
+ * change of state at this level: per-node, per-cgroup, per-lruvec.
+ */
+void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
+ int val)
+{
+ /* Update node */
+ __mod_node_page_state(lruvec_pgdat(lruvec), idx, val);
+
+ /* Update memcg and lruvec */
+ if (!mem_cgroup_disabled())
+ __mod_memcg_lruvec_state(lruvec, idx, val);
+}
+
void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
{
pg_data_t *pgdat = page_pgdat(virt_to_page(p));