net: tcp: add key management to congestion control

This patch adds necessary infrastructure to the congestion control framework for later per route congestion control support. For a per route congestion control possibility, our aim is to store a unique u32 key identifier into dst metrics, which can then be mapped into a tcp_congestion_ops struct. We argue that having a RTAX key entry is the most simple, generic and easy way to manage, and also keeps the memory footprint of dst entries lower on 64 bit than with storing a pointer directly, for example. Having a unique key id also allows for decoupling actual TCP congestion control module management from the FIB layer, i.e. we don't have to care about expensive module refcounting inside the FIB at this point. We first thought of using an IDR store for the realization, which takes over dynamic assignment of unused key space and also performs the key to pointer mapping in RCU. While doing so, we stumbled upon the issue that due to the nature of dynamic key distribution, it just so happens, arguably in very rare occasions, that excessive module loads and unloads can lead to a possible reuse of previously used key space. Thus, previously stale keys in the dst metric are now being reassigned to a different congestion control algorithm, which might lead to unexpected behaviour. One way to resolve this would have been to walk FIBs on the actually rare occasion of a module unload and reset the metric keys for each FIB in each netns, but that's just very costly. Therefore, we argue a better solution is to reuse the unique congestion control algorithm name member and map that into u32 key space through jhash. For that, we split the flags attribute (as it currently uses 2 bits only anyway) into two u32 attributes, flags and key, so that we can keep the cacheline boundary of 2 cachelines on x86_64 and cache the precalculated key at registration time for the fast path. On average we might expect 2 - 4 modules being loaded worst case perhaps 15, so a key collision possibility is extremely low, and guaranteed collision-free on LE/BE for all in-tree modules. Overall this results in much simpler code, and all without the overhead of an IDR. Due to the deterministic nature, modules can now be unloaded, the congestion control algorithm for a specific but unloaded key will fall back to the default one, and on module reload time it will switch back to the expected algorithm transparently. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Daniel Borkmann <dborkman@redhat.com> 2015-01-05 23:57:46 +0100
committer: David S. Miller <davem@davemloft.net> 2015-01-05 22:55:24 -0500
commit: c5c6a8ab45ec0f18733afb4aaade0d4a139d80b3 (patch)
tree: c248b79eca4c665244c4280dbde577cb71d001df /include
parent: 29ba4fffd396f4beefe34d24d7bcd86cb5c3e492 (diff)
download: linux-c5c6a8ab45ec0f18733afb4aaade0d4a139d80b3.tar.bz2
2 files changed, 10 insertions, 2 deletions
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 848e85cb5c61..5976bdecf58b 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -98,7 +98,8 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-	__u8			  icsk_ca_state;
+	__u8			  icsk_ca_state:7,
+				  icsk_ca_dst_locked:1;
 	__u8			  icsk_retransmits;
 	__u8			  icsk_pending;
 	__u8			  icsk_backoff;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf76f..135b70c9a734 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,6 +787,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_MAX	128
 #define TCP_CA_BUF_MAX	(TCP_CA_NAME_MAX*TCP_CA_MAX)
 
+#define TCP_CA_UNSPEC	0
+
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
@@ -794,7 +796,8 @@ enum tcp_ca_ack_event_flags {
 
 struct tcp_congestion_ops {
 	struct list_head	list;
-	unsigned long flags;
+	u32 key;
+	u32 flags;
 
 	/* initialize private data (optional) */
 	void (*init)(struct sock *sk);
@@ -841,6 +844,10 @@ u32 tcp_reno_ssthresh(struct sock *sk);
 void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
+u32 tcp_ca_get_key_by_name(const char *name);
+char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
author	Daniel Borkmann <dborkman@redhat.com>	2015-01-05 23:57:46 +0100
committer	David S. Miller <davem@davemloft.net>	2015-01-05 22:55:24 -0500
commit	c5c6a8ab45ec0f18733afb4aaade0d4a139d80b3 (patch)
tree	c248b79eca4c665244c4280dbde577cb71d001df /include
parent	29ba4fffd396f4beefe34d24d7bcd86cb5c3e492 (diff)
download	linux-c5c6a8ab45ec0f18733afb4aaade0d4a139d80b3.tar.bz2