diff options
author | David S. Miller <davem@davemloft.net> | 2016-03-09 16:36:16 -0500 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2016-03-09 16:36:16 -0500 |
commit | 9531ab65f4ec066a6e6617a08a293c60397a161b (patch) | |
tree | 18b025fb9daf230bf9d0be894c24aab69361748f | |
parent | 26e9093110fb9ceb10093e4914b129b58d49a425 (diff) | |
parent | 10016594f4c6b3ef34c5de97d8ab62205d9d26a5 (diff) | |
download | linux-9531ab65f4ec066a6e6617a08a293c60397a161b.tar.bz2 |
Merge branch 'kcm'
Tom Herbert says:
====================
kcm: Kernel Connection Multiplexor (KCM)
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.
In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.
KCM implements an NxM multiplexor in the kernel as diagrammed below:
+------------+ +------------+ +------------+ +------------+
| KCM socket | | KCM socket | | KCM socket | | KCM socket |
+------------+ +------------+ +------------+ +------------+
| | | |
+-----------+ | | +----------+
| | | |
+----------------------------------+
| Multiplexor |
+----------------------------------+
| | | | |
+---------+ | | | ------------+
| | | | |
+----------+ +----------+ +----------+ +----------+ +----------+
| Psock | | Psock | | Psock | | Psock | | Psock |
+----------+ +----------+ +----------+ +----------+ +----------+
| | | | |
+----------+ +----------+ +----------+ +----------+ +----------+
| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
+----------+ +----------+ +----------+ +----------+ +----------+
The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).
A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.
In this patch set:
- Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to
indicate that more messages will be sent on the socket. The stack
may batch messages up if it is beneficial for transmission.
- In sendmmsg, set MSG_BATCH in all sub messages except for the last
one.
- In order to allow sendmmsg to contain multiple messages with
SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR.
- Add KCM module
- This supports SOCK_DGRAM and SOCK_SEQPACKET.
- KCM documentation
v2:
- Added splice and page operations.
- Assemble receive messages in place on TCP socket (don't have a
separate assembly queue.
- Based on above, enforce maxmimum receive message to be the size
of the recceive socket buffer.
- Support message assembly timeout. Use the timeout value in
sk_rcvtimeo on the TCP socket.
- Tested some with a couple of other production applications,
see ~5% improvement in application latency.
Testing:
Dave Watson has integrated KCM into Thrift and we intend to put these
changes into open source. Example of this is in:
https://github.com/djwatson/fbthrift/commit/
dd7e0f9cf4e80912fdb90f6cd394db24e61a14cc
Some initial KCM Thrift benchmark numbers (comment from Dave)
Thrift by default ties a single connection to a single thread. KCM is
instead able to load balance multiple connections across multiple epoll
loops easily.
A test sending ~5k bytes of data to a kcm thrift server, dropping the
bytes on recv:
QPS Latency / std dev Latency
without KCM
70336 209/123
with KCM
70353 191/124
A test sending a small request, then doing work in the epoll thread,
before serving more requests:
QPS Latency / std dev Latency
without KCM
14282 559/602
with KCM
23192 344/234
At the high end, there's definitely some additional kernel overhead:
Cranking the pipelining way up, with lots of small requests
QPS Latency / std dev Latency
without KCM
1863429 127/119
with KCM
1337713 192/241
---
So for a "realistic" workload, KCM performs pretty well (second case).
Under extreme conditions of highest tps we still have some work to do.
In its nature a multiplexor will spread work between CPUs which is
logically good for load balancing but coan conflict with the goal
promoting affinity. Batching messages on both send and receive are
the means to recoup performance.
Future support:
- Integration with TLS (TLS-in-kernel is a separate initiative).
- Page operations/splice support
- Unconnected KCM sockets. Will be able to attach sockets to different
destinations, AF_KCM addresses with be used in sendmsg and recvmsg
to indicate destination
- Explore more utility in performing BPF inline with a TCP data stream
(setting SO_MARK, rxhash for messages being sent received on
KCM sockets).
- Performance work
- Diagnose performance issues under high message load
FAQ (Questions posted on LWN)
Q: Why do this in the kernel?
A: Because the kernel is good at scheduling threads and steering packets
to threads. KCM fits well into this model since it allows the unit
of work for scheduling and steering to be the application layer
messages themselves. KCM should be thought of as generic application
protocol acceleration. It to the philosophy that the kernel provides
generic and extensible interfaces.
Q: How can adding code in the path yield better performance?
A: It is true that for just sending receiving a single message there
would be some performance loss since the code path is longer (for
instance comparing netperf to KCM). But for real production
applications performance takes on many dynamics. Parallelism, context
switching, affinity, granularity of locking, and load balancing are
all relevant. The theory of KCM is that by an application-centric
interface, the kernel can provide better support for these
performance characteristics.
Q: Why not use an existing message-oriented protocol such as RUDP,
DCCP, SCTP, RDS, and others?
A: Because that would entail using a completely new transport protocol.
Deploying a new protocol at scale is either a huge undertaking or
fundamentally infeasible. This is true in either the Internet and in
the data center due in a large part to protocol ossification.
Besides, KCM we want KCM to work existing, well deployed application
protocols that we couldn't change even if we wanted to (e.g. http/2).
KCM simply defines a new interface method, it does not redefine any
aspect of the transport protocol nor application protocol, nor set
any new requirements on these. Neither does KCM attempt to implement
any application protocol logic other than message deliniation in the
stream. These are fundamental requirement of KCM.
Q: How does this affect TCP?
A: It doesn't, not in the slightest. The use of KCM can be one-sided,
KCM has no effect on the wire.
Q: Why force TCP into doing something it's not designed for?
A: TCP is defined as transport protocol and there is no standard that
says the API into TCP must be stream based sockets, or for that
matter sockets at all (or even that TCP needs to be implemented in a
kernel). KCM is not inconsistent with the design of TCP just because
to makes an message based interface over TCP, if it were then every
application protocol sending messages over TCP would also be! :-)
Q: What about the problem of a connections with very slow rate of
incoming data? As a result your application can get storms of very
short reads. And it actually happens a lot with connection from
mobile devices and it is a problem for servers handling a lot of
connections.
A: The storm of short reads will occur regardless of whether KCM is used
or not. KCM does have one advantage in this scenario though, it will
only wake up the application when a full message has been received,
not for each packet that makes up part of a bigger messages. If a
bunch of small messages are received, the application can receive
messages in batches using recvmmsg.
Q: Why not just use DPDK, or at least provide KCM like functionality in
DPDK?
A: DPDK, or more generally OS bypass presumably with a TCP stack in
userland, presents a different model of load balancing than that of
KCM (and the kernel). KCM implements load balancing of messages
across the threads of an application, whereas DPDK load balances
based on queues which are more static and coarse-grained since
multiple connections are bound to queues. DPDK works best when
processing of packets is silo'ed in a thread on the CPU processing
a queue, and packet processing (for both the stack and application)
is fairly uniform. KCM works well for applications where the amount
of work to process messages varies an application work is commonly
delegated to worker threads often on different CPUs.
The message based interface over TCP is something that could be
provide by a DPDK or OS bypass library.
Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web
sockets?
A: Yes. KCM is most appropriate for message based protocols over TCP
where is easy to deduce the message length (e.g. a length field)
and the protocol implements its own message ordering semantics.
Fortunately this encompasses many modern protocols.
Q: How is memory limited and controlled?
A: In v2 all data for messages is now kept in socket buffers, either
those for TCP or KCM, so socket buffer limits are applicable.
This includes receive messages assembly which is now done ont teh
TCP socket buffer instead of a separate queue-- this has the
consequence that the TCP socket buffer limit provides an
enforceable maxmimum message size.
Additionally, a timeout may be set for messages assembly. The
value used for this is taken from sk_rcvtimeo of the TCP socket.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-rw-r--r-- | Documentation/networking/kcm.txt | 285 | ||||
-rw-r--r-- | include/linux/net.h | 1 | ||||
-rw-r--r-- | include/linux/rculist.h | 21 | ||||
-rw-r--r-- | include/linux/socket.h | 7 | ||||
-rw-r--r-- | include/net/kcm.h | 226 | ||||
-rw-r--r-- | include/net/tcp.h | 24 | ||||
-rw-r--r-- | include/uapi/linux/kcm.h | 40 | ||||
-rw-r--r-- | net/Kconfig | 1 | ||||
-rw-r--r-- | net/Makefile | 1 | ||||
-rw-r--r-- | net/core/skbuff.c | 39 | ||||
-rw-r--r-- | net/ipv4/tcp.c | 15 | ||||
-rw-r--r-- | net/kcm/Kconfig | 10 | ||||
-rw-r--r-- | net/kcm/Makefile | 3 | ||||
-rw-r--r-- | net/kcm/kcmproc.c | 426 | ||||
-rw-r--r-- | net/kcm/kcmsock.c | 2409 | ||||
-rw-r--r-- | net/socket.c | 18 |
16 files changed, 3483 insertions, 43 deletions
diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt new file mode 100644 index 000000000000..3476ede5bc2c --- /dev/null +++ b/Documentation/networking/kcm.txt @@ -0,0 +1,285 @@ +Kernel Connection Mulitplexor +----------------------------- + +Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based +interface over TCP for generic application protocols. With KCM an application +can efficiently send and receive application protocol messages over TCP using +datagram sockets. + +KCM implements an NxM multiplexor in the kernel as diagrammed below: + ++------------+ +------------+ +------------+ +------------+ +| KCM socket | | KCM socket | | KCM socket | | KCM socket | ++------------+ +------------+ +------------+ +------------+ + | | | | + +-----------+ | | +----------+ + | | | | + +----------------------------------+ + | Multiplexor | + +----------------------------------+ + | | | | | + +---------+ | | | ------------+ + | | | | | ++----------+ +----------+ +----------+ +----------+ +----------+ +| Psock | | Psock | | Psock | | Psock | | Psock | ++----------+ +----------+ +----------+ +----------+ +----------+ + | | | | | ++----------+ +----------+ +----------+ +----------+ +----------+ +| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | ++----------+ +----------+ +----------+ +----------+ +----------+ + +KCM sockets +----------- + +The KCM sockets provide the user interface to the muliplexor. All the KCM sockets +bound to a multiplexor are considered to have equivalent function, and I/O +operations in different sockets may be done in parallel without the need for +synchronization between threads in userspace. + +Multiplexor +----------- + +The multiplexor provides the message steering. In the transmit path, messages +written on a KCM socket are sent atomically on an appropriate TCP socket. +Similarly, in the receive path, messages are constructed on each TCP socket +(Psock) and complete messages are steered to a KCM socket. + +TCP sockets & Psocks +-------------------- + +TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated +for each bound TCP socket, this structure holds the state for constructing +messages on receive as well as other connection specific information for KCM. + +Connected mode semantics +------------------------ + +Each multiplexor assumes that all attached TCP connections are to the same +destination and can use the different connections for load balancing when +transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) +can be used to send and receive messages from the KCM socket. + +Socket types +------------ + +KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. + +Message delineation +------------------- + +Messages are sent over a TCP stream with some application protocol message +format that typically includes a header which frames the messages. The length +of a received message can be deduced from the application protocol header +(often just a simple length field). + +A TCP stream must be parsed to determine message boundaries. Berkeley Packet +Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a +BPF program must be specified. The program is called at the start of receiving +a new message and is given an skbuff that contains the bytes received so far. +It parses the message header and returns the length of the message. Given this +information, KCM will construct the message of the stated length and deliver it +to a KCM socket. + +TCP socket management +--------------------- + +When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and +write space available (POLLOUT) events are handled by the multiplexor. If there +is a state change (disconnection) or other error on a TCP socket, an error is +posted on the TCP socket so that a POLLERR event happens and KCM discontinues +using the socket. When the application gets the error notification for a +TCP socket, it should unattach the socket from KCM and then handle the error +condition (the typical response is to close the socket and create a new +connection if necessary). + +KCM limits the maximum receive message size to be the size of the receive +socket buffer on the attached TCP socket (the socket buffer size can be set by +SO_RCVBUF). If the length of a new message reported by the BPF program is +greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP +socket. The BPF program may also enforce a maximum messages size and report an +error when it is exceeded. + +A timeout may be set for assembling messages on a receive socket. The timeout +value is taken from the receive timeout of the attached TCP socket (this is set +by SO_RCVTIMEO). If the timer expires before assembly is complete an error +(ETIMEDOUT) is posted on the socket. + +User interface +============== + +Creating a multiplexor +---------------------- + +A new multiplexor and initial KCM socket is created by a socket call: + + socket(AF_KCM, type, protocol) + + - type is either SOCK_DGRAM or SOCK_SEQPACKET + - protocol is KCMPROTO_CONNECTED + +Cloning KCM sockets +------------------- + +After the first KCM socket is created using the socket call as described +above, additional sockets for the multiplexor can be created by cloning +a KCM socket. This is accomplished by an ioctl on a KCM socket: + + /* From linux/kcm.h */ + struct kcm_clone { + int fd; + }; + + struct kcm_clone info; + + memset(&info, 0, sizeof(info)); + + err = ioctl(kcmfd, SIOCKCMCLONE, &info); + + if (!err) + newkcmfd = info.fd; + +Attach transport sockets +------------------------ + +Attaching of transport sockets to a multiplexor is performed by calling an +ioctl on a KCM socket for the multiplexor. e.g.: + + /* From linux/kcm.h */ + struct kcm_attach { + int fd; + int bpf_fd; + }; + + struct kcm_attach info; + + memset(&info, 0, sizeof(info)); + + info.fd = tcpfd; + info.bpf_fd = bpf_prog_fd; + + ioctl(kcmfd, SIOCKCMATTACH, &info); + +The kcm_attach structure contains: + fd: file descriptor for TCP socket being attached + bpf_prog_fd: file descriptor for compiled BPF program downloaded + +Unattach transport sockets +-------------------------- + +Unattaching a transport socket from a multiplexor is straightforward. An +"unattach" ioctl is done with the kcm_unattach structure as the argument: + + /* From linux/kcm.h */ + struct kcm_unattach { + int fd; + }; + + struct kcm_unattach info; + + memset(&info, 0, sizeof(info)); + + info.fd = cfd; + + ioctl(fd, SIOCKCMUNATTACH, &info); + +Disabling receive on KCM socket +------------------------------- + +A setsockopt is used to disable or enable receiving on a KCM socket. +When receive is disabled, any pending messages in the socket's +receive buffer are moved to other sockets. This feature is useful +if an application thread knows that it will be doing a lot of +work on a request and won't be able to service new messages for a +while. Example use: + + int val = 1; + + setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) + +BFP programs for message delineation +------------------------------------ + +BPF programs can be compiled using the BPF LLVM backend. For exmple, +the BPF program for parsing Thrift is: + + #include "bpf.h" /* for __sk_buff */ + #include "bpf_helpers.h" /* for load_word intrinsic */ + + SEC("socket_kcm") + int bpf_prog1(struct __sk_buff *skb) + { + return load_word(skb, 0) + 4; + } + + char _license[] SEC("license") = "GPL"; + +Use in applications +=================== + +KCM accelerates application layer protocols. Specifically, it allows +applications to use a message based interface for sending and receiving +messages. The kernel provides necessary assurances that messages are sent +and received atomically. This relieves much of the burden applications have +in mapping a message based protocol onto the TCP stream. KCM also make +application layer messages a unit of work in the kernel for the purposes of +steerng and scheduling, which in turn allows a simpler networking model in +multithreaded applications. + +Configurations +-------------- + +In an Nx1 configuration, KCM logically provides multiple socket handles +to the same TCP connection. This allows parallelism between in I/O +operations on the TCP socket (for instance copyin and copyout of data is +parallelized). In an application, a KCM socket can be opened for each +processing thread and inserted into the epoll (similar to how SO_REUSEPORT +is used to allow multiple listener sockets on the same port). + +In a MxN configuration, multiple connections are established to the +same destination. These are used for simple load balancing. + +Message batching +---------------- + +The primary purpose of KCM is load balancing between KCM sockets and hence +threads in a nominal use case. Perfect load balancing, that is steering +each received message to a different KCM socket or steering each sent +message to a different TCP socket, can negatively impact performance +since this doesn't allow for affinities to be established. Balancing +based on groups, or batches of messages, can be beneficial for performance. + +On transmit, there are three ways an application can batch (pipeline) +messages on a KCM socket. + 1) Send multiple messages in a single sendmmsg. + 2) Send a group of messages each with a sendmsg call, where all messages + except the last have MSG_BATCH in the flags of sendmsg call. + 3) Create "super message" composed of multiple messages and send this + with a single sendmsg. + +On receive, the KCM module attempts to queue messages received on the +same KCM socket during each TCP ready callback. The targeted KCM socket +changes at each receive ready callback on the KCM socket. The application +does not need to configure this. + +Error handling +-------------- + +An application should include a thread to monitor errors raised on +the TCP connection. Normally, this will be done by placing each +TCP socket attached to a KCM multiplexor in epoll set for POLLERR +event. If an error occurs on an attached TCP socket, KCM sets an EPIPE +on the socket thus waking up the application thread. When the application +sees the error (which may just be a disconnect) it should unattach the +socket from KCM and then close it. It is assumed that once an error is +posted on the TCP socket the data stream is unrecoverable (i.e. an error +may have occurred in in the middle of receiving a messssge). + +TCP connection monitoring +------------------------- + +In KCM there is no means to correlate a message to the TCP socket that +was used to send or receive the message (except in the case there is +only one attached TCP socket). However, the application does retain +an open file descriptor to the socket so it will be able to get statistics +from the socket which can be used in detecting issues (such as high +retransmissions on the socket). diff --git a/include/linux/net.h b/include/linux/net.h index 0b4ac7da583a..49175e4ced11 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -215,6 +215,7 @@ int __sock_create(struct net *net, int family, int type, int proto, int sock_create(int family, int type, int proto, struct socket **res); int sock_create_kern(struct net *net, int family, int type, int proto, struct socket **res); int sock_create_lite(int family, int type, int proto, struct socket **res); +struct socket *sock_alloc(void); void sock_release(struct socket *sock); int sock_sendmsg(struct socket *sock, struct msghdr *msg); int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 14ec1652daf4..17d4f849c65e 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -319,6 +319,27 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, }) /** + * list_next_or_null_rcu - get the first element from a list + * @head: the head for the list. + * @ptr: the list head to take the next element from. + * @type: the type of the struct this is embedded in. + * @member: the name of the list_head within the struct. + * + * Note that if the ptr is at the end of the list, NULL is returned. + * + * This primitive may safely run concurrently with the _rcu list-mutation + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). + */ +#define list_next_or_null_rcu(head, ptr, type, member) \ +({ \ + struct list_head *__head = (head); \ + struct list_head *__ptr = (ptr); \ + struct list_head *__next = READ_ONCE(__ptr->next); \ + likely(__next != __head) ? list_entry_rcu(__next, type, \ + member) : NULL; \ +}) + +/** * list_for_each_entry_rcu - iterate over rcu list of given type * @pos: the type * to use as a loop cursor. * @head: the head for your list. diff --git a/include/linux/socket.h b/include/linux/socket.h index 5bf59c8493b7..73bf6c6a833b 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -200,7 +200,9 @@ struct ucred { #define AF_ALG 38 /* Algorithm sockets */ #define AF_NFC 39 /* NFC sockets */ #define AF_VSOCK 40 /* vSockets */ -#define AF_MAX 41 /* For now.. */ +#define AF_KCM 41 /* Kernel Connection Multiplexor*/ + +#define AF_MAX 42 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -246,6 +248,7 @@ struct ucred { #define PF_ALG AF_ALG #define PF_NFC AF_NFC #define PF_VSOCK AF_VSOCK +#define PF_KCM AF_KCM #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ @@ -274,6 +277,7 @@ struct ucred { #define MSG_MORE 0x8000 /* Sender will send more */ #define MSG_WAITFORONE 0x10000 /* recvmmsg(): block until 1+ packets avail */ #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */ +#define MSG_BATCH 0x40000 /* sendmmsg(): more messages coming */ #define MSG_EOF MSG_FIN #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ @@ -322,6 +326,7 @@ struct ucred { #define SOL_CAIF 278 #define SOL_ALG 279 #define SOL_NFC 280 +#define SOL_KCM 281 /* IPX options */ #define IPX_TYPE 1 diff --git a/include/net/kcm.h b/include/net/kcm.h new file mode 100644 index 000000000000..95c425ca97b6 --- /dev/null +++ b/include/net/kcm.h @@ -0,0 +1,226 @@ +/* + * Kernel Connection Multiplexor + * + * Copyright (c) 2016 Tom Herbert <tom@herbertland.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + */ + +#ifndef __NET_KCM_H_ +#define __NET_KCM_H_ + +#include <linux/skbuff.h> +#include <net/sock.h> +#include <uapi/linux/kcm.h> + +extern unsigned int kcm_net_id; + +#define KCM_STATS_ADD(stat, count) ((stat) += (count)) +#define KCM_STATS_INCR(stat) ((stat)++) + +struct kcm_psock_stats { + unsigned long long rx_msgs; + unsigned long long rx_bytes; + unsigned long long tx_msgs; + unsigned long long tx_bytes; + unsigned int rx_aborts; + unsigned int rx_mem_fail; + unsigned int rx_need_more_hdr; + unsigned int rx_msg_too_big; + unsigned int rx_msg_timeouts; + unsigned int rx_bad_hdr_len; + unsigned long long reserved; + unsigned long long unreserved; + unsigned int tx_aborts; +}; + +struct kcm_mux_stats { + unsigned long long rx_msgs; + unsigned long long rx_bytes; + unsigned long long tx_msgs; + unsigned long long tx_bytes; + unsigned int rx_ready_drops; + unsigned int tx_retries; + unsigned int psock_attach; + unsigned int psock_unattach_rsvd; + unsigned int psock_unattach; +}; + +struct kcm_stats { + unsigned long long rx_msgs; + unsigned long long rx_bytes; + unsigned long long tx_msgs; + unsigned long long tx_bytes; +}; + +struct kcm_tx_msg { + unsigned int sent; + unsigned int fragidx; + unsigned int frag_offset; + unsigned int msg_flags; + struct sk_buff *frag_skb; + struct sk_buff *last_skb; +}; + +struct kcm_rx_msg { + int full_len; + int accum_len; + int offset; + int early_eaten; +}; + +/* Socket structure for KCM client sockets */ +struct kcm_sock { + struct sock sk; + struct kcm_mux *mux; + struct list_head kcm_sock_list; + int index; + u32 done : 1; + struct work_struct done_work; + + struct kcm_stats stats; + + /* Transmit */ + struct kcm_psock *tx_psock; + struct work_struct tx_work; + struct list_head wait_psock_list; + struct sk_buff *seq_skb; + + /* Don't use bit fields here, these are set under different locks */ + bool tx_wait; + bool tx_wait_more; + + /* Receive */ + struct kcm_psock *rx_psock; + struct list_head wait_rx_list; /* KCMs waiting for receiving */ + bool rx_wait; + u32 rx_disabled : 1; +}; + +struct bpf_prog; + +/* Structure for an attached lower socket */ +struct kcm_psock { + struct sock *sk; + struct kcm_mux *mux; + int index; + + u32 tx_stopped : 1; + u32 rx_stopped : 1; + u32 done : 1; + u32 unattaching : 1; + + void (*save_state_change)(struct sock *sk); + void (*save_data_ready)(struct sock *sk); + void (*save_write_space)(struct sock *sk); + + struct list_head psock_list; + + struct kcm_psock_stats stats; + + /* Receive */ + struct sk_buff *rx_skb_head; + struct sk_buff **rx_skb_nextp; + struct sk_buff *ready_rx_msg; + struct list_head psock_ready_list; + struct work_struct rx_work; + struct delayed_work rx_delayed_work; + struct bpf_prog *bpf_prog; + struct kcm_sock *rx_kcm; + unsigned long long saved_rx_bytes; + unsigned long long saved_rx_msgs; + struct timer_list rx_msg_timer; + unsigned int rx_need_bytes; + + /* Transmit */ + struct kcm_sock *tx_kcm; + struct list_head psock_avail_list; + unsigned long long saved_tx_bytes; + unsigned long long saved_tx_msgs; +}; + +/* Per net MUX list */ +struct kcm_net { + struct mutex mutex; + struct kcm_psock_stats aggregate_psock_stats; + struct kcm_mux_stats aggregate_mux_stats; + struct list_head mux_list; + int count; +}; + +/* Structure for a MUX */ +struct kcm_mux { + struct list_head kcm_mux_list; + struct rcu_head rcu; + struct kcm_net *knet; + + struct list_head kcm_socks; /* All KCM sockets on MUX */ + int kcm_socks_cnt; /* Total KCM socket count for MUX */ + struct list_head psocks; /* List of all psocks on MUX */ + int psocks_cnt; /* Total attached sockets */ + + struct kcm_mux_stats stats; + struct kcm_psock_stats aggregate_psock_stats; + + /* Receive */ + spinlock_t rx_lock ____cacheline_aligned_in_smp; + struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */ + struct list_head psocks_ready; /* List of psocks with a msg ready */ + struct sk_buff_head rx_hold_queue; + + /* Transmit */ + spinlock_t lock ____cacheline_aligned_in_smp; /* TX and mux locking */ + struct list_head psocks_avail; /* List of available psocks */ + struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */ +}; + +#ifdef CONFIG_PROC_FS +int kcm_proc_init(void); +void kcm_proc_exit(void); +#else +static int kcm_proc_init(void) { return 0; } +static void kcm_proc_exit(void) { } +#endif + +static inline void aggregate_psock_stats(struct kcm_psock_stats *stats, + struct kcm_psock_stats *agg_stats) +{ + /* Save psock statistics in the mux when psock is being unattached. */ + +#define SAVE_PSOCK_STATS(_stat) (agg_stats->_stat += stats->_stat) + SAVE_PSOCK_STATS(rx_msgs); + SAVE_PSOCK_STATS(rx_bytes); + SAVE_PSOCK_STATS(rx_aborts); + SAVE_PSOCK_STATS(rx_mem_fail); + SAVE_PSOCK_STATS(rx_need_more_hdr); + SAVE_PSOCK_STATS(rx_msg_too_big); + SAVE_PSOCK_STATS(rx_msg_timeouts); + SAVE_PSOCK_STATS(rx_bad_hdr_len); + SAVE_PSOCK_STATS(tx_msgs); + SAVE_PSOCK_STATS(tx_bytes); + SAVE_PSOCK_STATS(reserved); + SAVE_PSOCK_STATS(unreserved); + SAVE_PSOCK_STATS(tx_aborts); +#undef SAVE_PSOCK_STATS +} + +static inline void aggregate_mux_stats(struct kcm_mux_stats *stats, + struct kcm_mux_stats *agg_stats) +{ + /* Save psock statistics in the mux when psock is being unattached. */ + +#define SAVE_MUX_STATS(_stat) (agg_stats->_stat += stats->_stat) + SAVE_MUX_STATS(rx_msgs); + SAVE_MUX_STATS(rx_bytes); + SAVE_MUX_STATS(tx_msgs); + SAVE_MUX_STATS(tx_bytes); + SAVE_MUX_STATS(rx_ready_drops); + SAVE_MUX_STATS(psock_attach); + SAVE_MUX_STATS(psock_unattach_rsvd); + SAVE_MUX_STATS(psock_unattach); +#undef SAVE_MUX_STATS +} + +#endif /* __NET_KCM_H_ */ diff --git a/include/net/tcp.h b/include/net/tcp.h index e90db8546806..0302636af98c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1816,4 +1816,28 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff *skb) skb->truesize = 2; } +static inline int tcp_inq(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + int answ; + + if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { + answ = 0; + } else if (sock_flag(sk, SOCK_URGINLINE) || + !tp->urg_data || + before(tp->urg_seq, tp->copied_seq) || + !before(tp->urg_seq, tp->rcv_nxt)) { + + answ = tp->rcv_nxt - tp->copied_seq; + + /* Subtract 1, if FIN was received */ + if (answ && sock_flag(sk, SOCK_DONE)) + answ--; + } else { + answ = tp->urg_seq - tp->copied_seq; + } + + return answ; +} + #endif /* _TCP_H */ diff --git a/include/uapi/linux/kcm.h b/include/uapi/linux/kcm.h new file mode 100644 index 000000000000..a5a530940b99 --- /dev/null +++ b/include/uapi/linux/kcm.h @@ -0,0 +1,40 @@ +/* + * Kernel Connection Multiplexor + * + * Copyright (c) 2016 Tom Herbert <tom@herbertland.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * User API to clone KCM sockets and attach transport socket to a KCM + * multiplexor. + */ + +#ifndef KCM_KERNEL_H +#define KCM_KERNEL_H + +struct kcm_attach { + int fd; + int bpf_fd; +}; + +struct kcm_unattach { + int fd; +}; + +struct kcm_clone { + int fd; +}; + +#define SIOCKCMATTACH (SIOCPROTOPRIVATE + 0) +#define SIOCKCMUNATTACH (SIOCPROTOPRIVATE + 1) +#define SIOCKCMCLONE (SIOCPROTOPRIVATE + 2) + +#define KCMPROTO_CONNECTED 0 + +/* Socket options */ +#define KCM_RECV_DISABLE 1 + +#endif + diff --git a/net/Kconfig b/net/Kconfig index 2760825e53fa..10640d5f8bee 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -360,6 +360,7 @@ source "net/can/Kconfig" source "net/irda/Kconfig" source "net/bluetooth/Kconfig" source "net/rxrpc/Kconfig" +source "net/kcm/Kconfig" config FIB_RULES bool diff --git a/net/Makefile b/net/Makefile index a5d04098dfce..81d14119eab5 100644 --- a/net/Makefile +++ b/net/Makefile @@ -34,6 +34,7 @@ obj-$(CONFIG_IRDA) += irda/ obj-$(CONFIG_BT) += bluetooth/ obj-$(CONFIG_SUNRPC) += sunrpc/ obj-$(CONFIG_AF_RXRPC) += rxrpc/ +obj-$(CONFIG_AF_KCM) += kcm/ obj-$(CONFIG_ATM) += atm/ obj-$(CONFIG_L2TP) += l2tp/ obj-$(CONFIG_DECNET) += decnet/ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9d7be61e5e6b..51d768e7bc90 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1918,6 +1918,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; + struct sk_buff *iter; /* map the linear part : * If skb->head_frag is set, this 'linear' part is backed by a @@ -1944,6 +1945,19 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, return true; } + skb_walk_frags(skb, iter) { + if (*offset >= iter->len) { + *offset -= iter->len; + continue; + } + /* __skb_splice_bits() only fails if the output has no room + * left, so no point in going over the frag_list for the error + * case. + */ + if (__skb_splice_bits(iter, pipe, offset, len, spd, sk)) + return true; + } + return false; } @@ -1970,9 +1984,7 @@ ssize_t skb_socket_splice(struct sock *sk, /* * Map data from the skb to a pipe. Should handle both the linear part, - * the fragments, and the frag list. It does NOT handle frag lists within - * the frag list, if such a thing exists. We'd probably need to recurse to - * handle that cleanly. + * the fragments, and the frag list. */ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset, struct pipe_inode_info *pipe, unsigned int tlen, @@ -1991,29 +2003,10 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset, .ops = &nosteal_pipe_buf_ops, .spd_release = sock_spd_release, }; - struct sk_buff *frag_iter; int ret = 0; - /* - * __skb_splice_bits() only fails if the output has no room left, - * so no point in going over the frag_list for the error case. - */ - if (__skb_splice_bits(skb, pipe, &offset, &tlen, &spd, sk)) - goto done; - else if (!tlen) - goto done; + __skb_splice_bits(skb, pipe, &offset, &tlen, &spd, sk); - /* - * now see if we have a frag_list to map - */ - skb_walk_frags(skb, frag_iter) { - if (!tlen) - break; - if (__skb_splice_bits(frag_iter, pipe, &offset, &tlen, &spd, sk)) - break; - } - -done: if (spd.nr_pages) ret = splice_cb(sk, pipe, &spd); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f9faadb42485..a265f00b9df9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -556,20 +556,7 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg) return -EINVAL; slow = lock_sock_fast(sk); - if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) - answ = 0; - else if (sock_flag(sk, SOCK_URGINLINE) || - !tp->urg_data || - before(tp->urg_seq, tp->copied_seq) || - !before(tp->urg_seq, tp->rcv_nxt)) { - - answ = tp->rcv_nxt - tp->copied_seq; - - /* Subtract 1, if FIN was received */ - if (answ && sock_flag(sk, SOCK_DONE)) - answ--; - } else - answ = tp->urg_seq - tp->copied_seq; + answ = tcp_inq(sk); unlock_sock_fast(sk, slow); break; case SIOCATMARK: diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig new file mode 100644 index 000000000000..5db94d940ecc --- /dev/null +++ b/net/kcm/Kconfig @@ -0,0 +1,10 @@ + +config AF_KCM + tristate "KCM sockets" + depends on INET + select BPF_SYSCALL + ---help--- + KCM (Kernel Connection Multiplexor) sockets provide a method + for multiplexing messages of a message based application + protocol over kernel connectons (e.g. TCP connections). + diff --git a/net/kcm/Makefile b/net/kcm/Makefile new file mode 100644 index 000000000000..71256133e677 --- /dev/null +++ b/net/kcm/Makefile @@ -0,0 +1,3 @@ +obj-$(CONFIG_AF_KCM) += kcm.o + +kcm-y := kcmsock.o kcmproc.o diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c new file mode 100644 index 000000000000..738008726cc6 --- /dev/null +++ b/net/kcm/kcmproc.c @@ -0,0 +1,426 @@ +#include <linux/in.h> +#include <linux/inet.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/net.h> +#include <linux/proc_fs.h> +#include <linux/rculist.h> +#include <linux/seq_file.h> +#include <linux/socket.h> +#include <net/inet_sock.h> +#include <net/kcm.h> +#include <net/net_namespace.h> +#include <net/netns/generic.h> +#include <net/tcp.h> + +#ifdef CONFIG_PROC_FS +struct kcm_seq_muxinfo { + char *name; + const struct file_operations *seq_fops; + const struct seq_operations seq_ops; +}; + +static struct kcm_mux *kcm_get_first(struct seq_file *seq) +{ + struct net *net = seq_file_net(seq); + struct kcm_net *knet = net_generic(net, kcm_net_id); + + return list_first_or_null_rcu(&knet->mux_list, + struct kcm_mux, kcm_mux_list); +} + +static struct kcm_mux *kcm_get_next(struct kcm_mux *mux) +{ + struct kcm_net *knet = mux->knet; + + return list_next_or_null_rcu(&knet->mux_list, &mux->kcm_mux_list, + struct kcm_mux, kcm_mux_list); +} + +static struct kcm_mux *kcm_get_idx(struct seq_file *seq, loff_t pos) +{ + struct net *net = seq_file_net(seq); + struct kcm_net *knet = net_generic(net, kcm_net_id); + struct kcm_mux *m; + + list_for_each_entry_rcu(m, &knet->mux_list, kcm_mux_list) { + if (!pos) + return m; + --pos; + } + return NULL; +} + +static void *kcm_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + void *p; + + if (v == SEQ_START_TOKEN) + p = kcm_get_first(seq); + else + p = kcm_get_next(v); + ++*pos; + return p; +} + +static void *kcm_seq_start(struct seq_file *seq, loff_t *pos) + __acquires(rcu) +{ + rcu_read_lock(); + + if (!*pos) + return SEQ_START_TOKEN; + else + return kcm_get_idx(seq, *pos - 1); +} + +static void kcm_seq_stop(struct seq_file *seq, void *v) + __releases(rcu) +{ + rcu_read_unlock(); +} + +struct kcm_proc_mux_state { + struct seq_net_private p; + int idx; +}; + +static int kcm_seq_open(struct inode *inode, struct file *file) +{ + struct kcm_seq_muxinfo *muxinfo = PDE_DATA(inode); + int err; + + err = seq_open_net(inode, file, &muxinfo->seq_ops, + sizeof(struct kcm_proc_mux_state)); + if (err < 0) + return err; + return err; +} + +static void kcm_format_mux_header(struct seq_file *seq) +{ + struct net *net = seq_file_net(seq); + struct kcm_net *knet = net_generic(net, kcm_net_id); + + seq_printf(seq, + "*** KCM statistics (%d MUX) ****\n", + knet->count); + + seq_printf(seq, + "%-14s %-10s %-16s %-10s %-16s %-8s %-8s %-8s %-8s %s", + "Object", + "RX-Msgs", + "RX-Bytes", + "TX-Msgs", + "TX-Bytes", + "Recv-Q", + "Rmem", + "Send-Q", + "Smem", + "Status"); + + /* XXX: pdsts header stuff here */ + seq_puts(seq, "\n"); +} + +static void kcm_format_sock(struct kcm_sock *kcm, struct seq_file *seq, + int i, int *len) +{ + seq_printf(seq, + " kcm-%-7u %-10llu %-16llu %-10llu %-16llu %-8d %-8d %-8d %-8s ", + kcm->index, + kcm->stats.rx_msgs, + kcm->stats.rx_bytes, + kcm->stats.tx_msgs, + kcm->stats.tx_bytes, + kcm->sk.sk_receive_queue.qlen, + sk_rmem_alloc_get(&kcm->sk), + kcm->sk.sk_write_queue.qlen, + "-"); + + if (kcm->tx_psock) + seq_printf(seq, "Psck-%u ", kcm->tx_psock->index); + + if (kcm->tx_wait) + seq_puts(seq, "TxWait "); + + if (kcm->tx_wait_more) + seq_puts(seq, "WMore "); + + if (kcm->rx_wait) + seq_puts(seq, "RxWait "); + + seq_puts(seq, "\n"); +} + +static void kcm_format_psock(struct kcm_psock *psock, struct seq_file *seq, + int i, int *len) +{ + seq_printf(seq, + " psock-%-5u %-10llu %-16llu %-10llu %-16llu %-8d %-8d %-8d %-8d ", + psock->index, + psock->stats.rx_msgs, + psock->stats.rx_bytes, + psock->stats.tx_msgs, + psock->stats.tx_bytes, + psock->sk->sk_receive_queue.qlen, + atomic_read(&psock->sk->sk_rmem_alloc), + psock->sk->sk_write_queue.qlen, + atomic_read(&psock->sk->sk_wmem_alloc)); + + if (psock->done) + seq_puts(seq, "Done "); + + if (psock->tx_stopped) + seq_puts(seq, "TxStop "); + + if (psock->rx_stopped) + seq_puts(seq, "RxStop "); + + if (psock->tx_kcm) + seq_printf(seq, "Rsvd-%d ", psock->tx_kcm->index); + + if (psock->ready_rx_msg) + seq_puts(seq, "RdyRx "); + + seq_puts(seq, "\n"); +} + +static void +kcm_format_mux(struct kcm_mux *mux, loff_t idx, struct seq_file *seq) +{ + int i, len; + struct kcm_sock *kcm; + struct kcm_psock *psock; + + /* mux information */ + seq_printf(seq, + "%-6s%-8s %-10llu %-16llu %-10llu %-16llu %-8s %-8s %-8s %-8s ", + "mux", "", + mux->stats.rx_msgs, + mux->stats.rx_bytes, + mux->stats.tx_msgs, + mux->stats.tx_bytes, + "-", "-", "-", "-"); + + seq_printf(seq, "KCMs: %d, Psocks %d\n", + mux->kcm_socks_cnt, mux->psocks_cnt); + + /* kcm sock information */ + i = 0; + spin_lock_bh(&mux->lock); + list_for_each_entry(kcm, &mux->kcm_socks, kcm_sock_list) { + kcm_format_sock(kcm, seq, i, &len); + i++; + } + i = 0; + list_for_each_entry(psock, &mux->psocks, psock_list) { + kcm_format_psock(psock, seq, i, &len); + i++; + } + spin_unlock_bh(&mux->lock); +} + +static int kcm_seq_show(struct seq_file *seq, void *v) +{ + struct kcm_proc_mux_state *mux_state; + + mux_state = seq->private; + if (v == SEQ_START_TOKEN) { + mux_state->idx = 0; + kcm_format_mux_header(seq); + } else { + kcm_format_mux(v, mux_state->idx, seq); + mux_state->idx++; + } + return 0; +} + +static const struct file_operations kcm_seq_fops = { + .owner = THIS_MODULE, + .open = kcm_seq_open, + .read = seq_read, + .llseek = seq_lseek, +}; + +static struct kcm_seq_muxinfo kcm_seq_muxinfo = { + .name = "kcm", + .seq_fops = &kcm_seq_fops, + .seq_ops = { + .show = kcm_seq_show, + .start = kcm_seq_start, + .next = kcm_seq_next, + .stop = kcm_seq_stop, + } +}; + +static int kcm_proc_register(struct net *net, struct kcm_seq_muxinfo *muxinfo) +{ + struct proc_dir_entry *p; + int rc = 0; + + p = proc_create_data(muxinfo->name, S_IRUGO, net->proc_net, + muxinfo->seq_fops, muxinfo); + if (!p) + rc = -ENOMEM; + return rc; +} +EXPORT_SYMBOL(kcm_proc_register); + +static void kcm_proc_unregister(struct net *net, + struct kcm_seq_muxinfo *muxinfo) +{ + remove_proc_entry(muxinfo->name, net->proc_net); +} +EXPORT_SYMBOL(kcm_proc_unregister); + +static int kcm_stats_seq_show(struct seq_file *seq, void *v) +{ + struct kcm_psock_stats psock_stats; + struct kcm_mux_stats mux_stats; + struct kcm_mux *mux; + struct kcm_psock *psock; + struct net *net = seq->private; + struct kcm_net *knet = net_generic(net, kcm_net_id); + + memset(&mux_stats, 0, sizeof(mux_stats)); + memset(&psock_stats, 0, sizeof(psock_stats)); + + mutex_lock(&knet->mutex); + + aggregate_mux_stats(&knet->aggregate_mux_stats, &mux_stats); + aggregate_psock_stats(&knet->aggregate_psock_stats, + &psock_stats); + + list_for_each_entry_rcu(mux, &knet->mux_list, kcm_mux_list) { + spin_lock_bh(&mux->lock); + aggregate_mux_stats(&mux->stats, &mux_stats); + aggregate_psock_stats(&mux->aggregate_psock_stats, + &psock_stats); + list_for_each_entry(psock, &mux->psocks, psock_list) + aggregate_psock_stats(&psock->stats, &psock_stats); + spin_unlock_bh(&mux->lock); + } + + mutex_unlock(&knet->mutex); + + seq_printf(seq, + "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s\n", + "MUX", + "RX-Msgs", + "RX-Bytes", + "TX-Msgs", + "TX-Bytes", + "TX-Retries", + "Attach", + "Unattach", + "UnattchRsvd", + "RX-RdyDrops"); + + seq_printf(seq, + "%-8s %-10llu %-16llu %-10llu %-16llu %-10u %-10u %-10u %-10u %-10u\n", + "", + mux_stats.rx_msgs, + mux_stats.rx_bytes, + mux_stats.tx_msgs, + mux_stats.tx_bytes, + mux_stats.tx_retries, + mux_stats.psock_attach, + mux_stats.psock_unattach_rsvd, + mux_stats.psock_unattach, + mux_stats.rx_ready_drops); + + seq_printf(seq, + "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s %-10s\n", + "Psock", + "RX-Msgs", + "RX-Bytes", + "TX-Msgs", + "TX-Bytes", + "Reserved", + "Unreserved", + "RX-Aborts", + "RX-MemFail", + "RX-NeedMor", + "RX-BadLen", + "RX-TooBig", + "RX-Timeout", + "TX-Aborts"); + + seq_printf(seq, + "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u %-10u %-10u %-10u %-10u %-10u %-10u\n", + "", + psock_stats.rx_msgs, + psock_stats.rx_bytes, + psock_stats.tx_msgs, + psock_stats.tx_bytes, + psock_stats.reserved, + psock_stats.unreserved, + psock_stats.rx_aborts, + psock_stats.rx_mem_fail, + psock_stats.rx_need_more_hdr, + psock_stats.rx_bad_hdr_len, + psock_stats.rx_msg_too_big, + psock_stats.rx_msg_timeouts, + psock_stats.tx_aborts); + + return 0; +} + +static int kcm_stats_seq_open(struct inode *inode, struct file *file) +{ + return single_open_net(inode, file, kcm_stats_seq_show); +} + +static const struct file_operations kcm_stats_seq_fops = { + .owner = THIS_MODULE, + .open = kcm_stats_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release_net, +}; + +static int kcm_proc_init_net(struct net *net) +{ + int err; + + if (!proc_create("kcm_stats", S_IRUGO, net->proc_net, + &kcm_stats_seq_fops)) { + err = -ENOMEM; + goto out_kcm_stats; + } + + err = kcm_proc_register(net, &kcm_seq_muxinfo); + if (err) + goto out_kcm; + + return 0; + +out_kcm: + remove_proc_entry("kcm_stats", net->proc_net); +out_kcm_stats: + return err; +} + +static void kcm_proc_exit_net(struct net *net) +{ + kcm_proc_unregister(net, &kcm_seq_muxinfo); + remove_proc_entry("kcm_stats", net->proc_net); +} + +static struct pernet_operations kcm_net_ops = { + .init = kcm_proc_init_net, + .exit = kcm_proc_exit_net, +}; + +int __init kcm_proc_init(void) +{ + return register_pernet_subsys(&kcm_net_ops); +} + +void __exit kcm_proc_exit(void) +{ + unregister_pernet_subsys(&kcm_net_ops); +} + +#endif /* CONFIG_PROC_FS */ diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c new file mode 100644 index 000000000000..40662d73204f --- /dev/null +++ b/net/kcm/kcmsock.c @@ -0,0 +1,2409 @@ +#include <linux/bpf.h> +#include <linux/errno.h> +#include <linux/errqueue.h> +#include <linux/file.h> +#include <linux/in.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/net.h> +#include <linux/netdevice.h> +#include <linux/poll.h> +#include <linux/rculist.h> +#include <linux/skbuff.h> +#include <linux/socket.h> +#include <linux/uaccess.h> +#include <linux/workqueue.h> +#include <net/kcm.h> +#include <net/netns/generic.h> +#include <net/sock.h> +#include <net/tcp.h> +#include <uapi/linux/kcm.h> + +unsigned int kcm_net_id; + +static struct kmem_cache *kcm_psockp __read_mostly; +static struct kmem_cache *kcm_muxp __read_mostly; +static struct workqueue_struct *kcm_wq; + +static inline struct kcm_sock *kcm_sk(const struct sock *sk) +{ + return (struct kcm_sock *)sk; +} + +static inline struct kcm_tx_msg *kcm_tx_msg(struct sk_buff *skb) +{ + return (struct kcm_tx_msg *)skb->cb; +} + +static inline struct kcm_rx_msg *kcm_rx_msg(struct sk_buff *skb) +{ + return (struct kcm_rx_msg *)((void *)skb->cb + + offsetof(struct qdisc_skb_cb, data)); +} + +static void report_csk_error(struct sock *csk, int err) +{ + csk->sk_err = EPIPE; + csk->sk_error_report(csk); +} + +/* Callback lock held */ +static void kcm_abort_rx_psock(struct kcm_psock *psock, int err, + struct sk_buff *skb) +{ + struct sock *csk = psock->sk; + + /* Unrecoverable error in receive */ + + del_timer(&psock->rx_msg_timer); + + if (psock->rx_stopped) + return; + + psock->rx_stopped = 1; + KCM_STATS_INCR(psock->stats.rx_aborts); + + /* Report an error on the lower socket */ + report_csk_error(csk, err); +} + +static void kcm_abort_tx_psock(struct kcm_psock *psock, int err, + bool wakeup_kcm) +{ + struct sock *csk = psock->sk; + struct kcm_mux *mux = psock->mux; + + /* Unrecoverable error in transmit */ + + spin_lock_bh(&mux->lock); + + if (psock->tx_stopped) { + spin_unlock_bh(&mux->lock); + return; + } + + psock->tx_stopped = 1; + KCM_STATS_INCR(psock->stats.tx_aborts); + + if (!psock->tx_kcm) { + /* Take off psocks_avail list */ + list_del(&psock->psock_avail_list); + } else if (wakeup_kcm) { + /* In this case psock is being aborted while outside of + * write_msgs and psock is reserved. Schedule tx_work + * to handle the failure there. Need to commit tx_stopped + * before queuing work. + */ + smp_mb(); + + queue_work(kcm_wq, &psock->tx_kcm->tx_work); + } + + spin_unlock_bh(&mux->lock); + + /* Report error on lower socket */ + report_csk_error(csk, err); +} + +/* RX mux lock held. */ +static void kcm_update_rx_mux_stats(struct kcm_mux *mux, + struct kcm_psock *psock) +{ + KCM_STATS_ADD(mux->stats.rx_bytes, + psock->stats.rx_bytes - psock->saved_rx_bytes); + mux->stats.rx_msgs += + psock->stats.rx_msgs - psock->saved_rx_msgs; + psock->saved_rx_msgs = psock->stats.rx_msgs; + psock->saved_rx_bytes = psock->stats.rx_bytes; +} + +static void kcm_update_tx_mux_stats(struct kcm_mux *mux, + struct kcm_psock *psock) +{ + KCM_STATS_ADD(mux->stats.tx_bytes, + psock->stats.tx_bytes - psock->saved_tx_bytes); + mux->stats.tx_msgs += + psock->stats.tx_msgs - psock->saved_tx_msgs; + psock->saved_tx_msgs = psock->stats.tx_msgs; + psock->saved_tx_bytes = psock->stats.tx_bytes; +} + +static int kcm_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); + +/* KCM is ready to receive messages on its queue-- either the KCM is new or + * has become unblocked after being blocked on full socket buffer. Queue any + * pending ready messages on a psock. RX mux lock held. + */ +static void kcm_rcv_ready(struct kcm_sock *kcm) +{ + struct kcm_mux *mux = kcm->mux; + struct kcm_psock *psock; + struct sk_buff *skb; + + if (unlikely(kcm->rx_wait || kcm->rx_psock || kcm->rx_disabled)) + return; + + while (unlikely((skb = __skb_dequeue(&mux->rx_hold_queue)))) { + if (kcm_queue_rcv_skb(&kcm->sk, skb)) { + /* Assuming buffer limit has been reached */ + skb_queue_head(&mux->rx_hold_queue, skb); + WARN_ON(!sk_rmem_alloc_get(&kcm->sk)); + return; + } + } + + while (!list_empty(&mux->psocks_ready)) { + psock = list_first_entry(&mux->psocks_ready, struct kcm_psock, + psock_ready_list); + + if (kcm_queue_rcv_skb(&kcm->sk, psock->ready_rx_msg)) { + /* Assuming buffer limit has been reached */ + WARN_ON(!sk_rmem_alloc_get(&kcm->sk)); + return; + } + + /* Consumed the ready message on the psock. Schedule rx_work to + * get more messages. + */ + list_del(&psock->psock_ready_list); + psock->ready_rx_msg = NULL; + + /* Commit clearing of ready_rx_msg for queuing work */ + smp_mb(); + + queue_work(kcm_wq, &psock->rx_work); + } + + /* Buffer limit is okay now, add to ready list */ + list_add_tail(&kcm->wait_rx_list, + &kcm->mux->kcm_rx_waiters); + kcm->rx_wait = true; +} + +static void kcm_rfree(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + struct kcm_sock *kcm = kcm_sk(sk); + struct kcm_mux *mux = kcm->mux; + unsigned int len = skb->truesize; + + sk_mem_uncharge(sk, len); + atomic_sub(len, &sk->sk_rmem_alloc); + + /* For reading rx_wait and rx_psock without holding lock */ + smp_mb__after_atomic(); + + if (!kcm->rx_wait && !kcm->rx_psock && + sk_rmem_alloc_get(sk) < sk->sk_rcvlowat) { + spin_lock_bh(&mux->rx_lock); + kcm_rcv_ready(kcm); + spin_unlock_bh(&mux->rx_lock); + } +} + +static int kcm_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +{ + struct sk_buff_head *list = &sk->sk_receive_queue; + + if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) + return -ENOMEM; + + if (!sk_rmem_schedule(sk, skb, skb->truesize)) + return -ENOBUFS; + + skb->dev = NULL; + + skb_orphan(skb); + skb->sk = sk; + skb->destructor = kcm_rfree; + atomic_add(skb->truesize, &sk->sk_rmem_alloc); + sk_mem_charge(sk, skb->truesize); + + skb_queue_tail(list, skb); + + if (!sock_flag(sk, SOCK_DEAD)) + sk->sk_data_ready(sk); + + return 0; +} + +/* Requeue received messages for a kcm socket to other kcm sockets. This is + * called with a kcm socket is receive disabled. + * RX mux lock held. + */ +static void requeue_rx_msgs(struct kcm_mux *mux, struct sk_buff_head *head) +{ + struct sk_buff *skb; + struct kcm_sock *kcm; + + while ((skb = __skb_dequeue(head))) { + /* Reset destructor to avoid calling kcm_rcv_ready */ + skb->destructor = sock_rfree; + skb_orphan(skb); +try_again: + if (list_empty(&mux->kcm_rx_waiters)) { + skb_queue_tail(&mux->rx_hold_queue, skb); + continue; + } + + kcm = list_first_entry(&mux->kcm_rx_waiters, + struct kcm_sock, wait_rx_list); + + if (kcm_queue_rcv_skb(&kcm->sk, skb)) { + /* Should mean socket buffer full */ + list_del(&kcm->wait_rx_list); + kcm->rx_wait = false; + + /* Commit rx_wait to read in kcm_free */ + smp_wmb(); + + goto try_again; + } + } +} + +/* Lower sock lock held */ +static struct kcm_sock *reserve_rx_kcm(struct kcm_psock *psock, + struct sk_buff *head) +{ + struct kcm_mux *mux = psock->mux; + struct kcm_sock *kcm; + + WARN_ON(psock->ready_rx_msg); + + if (psock->rx_kcm) + return psock->rx_kcm; + + spin_lock_bh(&mux->rx_lock); + + if (psock->rx_kcm) { + spin_unlock_bh(&mux->rx_lock); + return psock->rx_kcm; + } + + kcm_update_rx_mux_stats(mux, psock); + + if (list_empty(&mux->kcm_rx_waiters)) { + psock->ready_rx_msg = head; + list_add_tail(&psock->psock_ready_list, + &mux->psocks_ready); + spin_unlock_bh(&mux->rx_lock); + return NULL; + } + + kcm = list_first_entry(&mux->kcm_rx_waiters, + struct kcm_sock, wait_rx_list); + list_del(&kcm->wait_rx_list); + kcm->rx_wait = false; + + psock->rx_kcm = kcm; + kcm->rx_psock = psock; + + spin_unlock_bh(&mux->rx_lock); + + return kcm; +} + +static void kcm_done(struct kcm_sock *kcm); + +static void kcm_done_work(struct work_struct *w) +{ + kcm_done(container_of(w, struct kcm_sock, done_work)); +} + +/* Lower sock held */ +static void unreserve_rx_kcm(struct kcm_psock *psock, + bool rcv_ready) +{ + struct kcm_sock *kcm = psock->rx_kcm; + struct kcm_mux *mux = psock->mux; + + if (!kcm) + return; + + spin_lock_bh(&mux->rx_lock); + + psock->rx_kcm = NULL; + kcm->rx_psock = NULL; + + /* Commit kcm->rx_psock before sk_rmem_alloc_get to sync with + * kcm_rfree + */ + smp_mb(); + + if (unlikely(kcm->done)) { + spin_unlock_bh(&mux->rx_lock); + + /* Need to run kcm_done in a task since we need to qcquire + * callback locks which may already be held here. + */ + INIT_WORK(&kcm->done_work, kcm_done_work); + schedule_work(&kcm->done_work); + return; + } + + if (unlikely(kcm->rx_disabled)) { + requeue_rx_msgs(mux, &kcm->sk.sk_receive_queue); + } else if (rcv_ready || unlikely(!sk_rmem_alloc_get(&kcm->sk))) { + /* Check for degenerative race with rx_wait that all + * data was dequeued (accounted for in kcm_rfree). + */ + kcm_rcv_ready(kcm); + } + spin_unlock_bh(&mux->rx_lock); +} + +static void kcm_start_rx_timer(struct kcm_psock *psock) +{ + if (psock->sk->sk_rcvtimeo) + mod_timer(&psock->rx_msg_timer, psock->sk->sk_rcvtimeo); +} + +/* Macro to invoke filter function. */ +#define KCM_RUN_FILTER(prog, ctx) \ + (*prog->bpf_func)(ctx, prog->insnsi) + +/* Lower socket lock held */ +static int kcm_tcp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb, + unsigned int orig_offset, size_t orig_len) +{ + struct kcm_psock *psock = (struct kcm_psock *)desc->arg.data; + struct kcm_rx_msg *rxm; + struct kcm_sock *kcm; + struct sk_buff *head, *skb; + size_t eaten = 0, cand_len; + ssize_t extra; + int err; + bool cloned_orig = false; + + if (psock->ready_rx_msg) + return 0; + + head = psock->rx_skb_head; + if (head) { + /* Message already in progress */ + + rxm = kcm_rx_msg(head); + if (unlikely(rxm->early_eaten)) { + /* Already some number of bytes on the receive sock + * data saved in rx_skb_head, just indicate they + * are consumed. + */ + eaten = orig_len <= rxm->early_eaten ? + orig_len : rxm->early_eaten; + rxm->early_eaten -= eaten; + + return eaten; + } + + if (unlikely(orig_offset)) { + /* Getting data with a non-zero offset when a message is + * in progress is not expected. If it does happen, we + * need to clone and pull since we can't deal with + * offsets in the skbs for a message expect in the head. + */ + orig_skb = skb_clone(orig_skb, GFP_ATOMIC); + if (!orig_skb) { + KCM_STATS_INCR(psock->stats.rx_mem_fail); + desc->error = -ENOMEM; + return 0; + } + if (!pskb_pull(orig_skb, orig_offset)) { + KCM_STATS_INCR(psock->stats.rx_mem_fail); + kfree_skb(orig_skb); + desc->error = -ENOMEM; + return 0; + } + cloned_orig = true; + orig_offset = 0; + } + + if (!psock->rx_skb_nextp) { + /* We are going to append to the frags_list of head. + * Need to unshare the frag_list. + */ + err = skb_unclone(head, GFP_ATOMIC); + if (err) { + KCM_STATS_INCR(psock->stats.rx_mem_fail); + desc->error = err; + return 0; + } + + if (unlikely(skb_shinfo(head)->frag_list)) { + /* We can't append to an sk_buff that already + * has a frag_list. We create a new head, point + * the frag_list of that to the old head, and + * then are able to use the old head->next for + * appending to the message. + */ + if (WARN_ON(head->next)) { + desc->error = -EINVAL; + return 0; + } + + skb = alloc_skb(0, GFP_ATOMIC); + if (!skb) { + KCM_STATS_INCR(psock->stats.rx_mem_fail); + desc->error = -ENOMEM; + return 0; + } + skb->len = head->len; + skb->data_len = head->len; + skb->truesize = head->truesize; + *kcm_rx_msg(skb) = *kcm_rx_msg(head); + psock->rx_skb_nextp = &head->next; + skb_shinfo(skb)->frag_list = head; + psock->rx_skb_head = skb; + head = skb; + } else { + psock->rx_skb_nextp = + &skb_shinfo(head)->frag_list; + } + } + } + + while (eaten < orig_len) { + /* Always clone since we will consume something */ + skb = skb_clone(orig_skb, GFP_ATOMIC); + if (!skb) { + KCM_STATS_INCR(psock->stats.rx_mem_fail); + desc->error = -ENOMEM; + break; + } + + cand_len = orig_len - eaten; + + head = psock->rx_skb_head; + if (!head) { + head = skb; + psock->rx_skb_head = head; + /* Will set rx_skb_nextp on next packet if needed */ + psock->rx_skb_nextp = NULL; + rxm = kcm_rx_msg(head); + memset(rxm, 0, sizeof(*rxm)); + rxm->offset = orig_offset + eaten; + } else { + /* Unclone since we may be appending to an skb that we + * already share a frag_list with. + */ + err = skb_unclone(skb, GFP_ATOMIC); + if (err) { + KCM_STATS_INCR(psock->stats.rx_mem_fail); + desc->error = err; + break; + } + + rxm = kcm_rx_msg(head); + *psock->rx_skb_nextp = skb; + psock->rx_skb_nextp = &skb->next; + head->data_len += skb->len; + head->len += skb->len; + head->truesize += skb->truesize; + } + + if (!rxm->full_len) { + ssize_t len; + + len = KCM_RUN_FILTER(psock->bpf_prog, head); + + if (!len) { + /* Need more header to determine length */ + if (!rxm->accum_len) { + /* Start RX timer for new message */ + kcm_start_rx_timer(psock); + } + rxm->accum_len += cand_len; + eaten += cand_len; + KCM_STATS_INCR(psock->stats.rx_need_more_hdr); + WARN_ON(eaten != orig_len); + break; + } else if (len > psock->sk->sk_rcvbuf) { + /* Message length exceeds maximum allowed */ + KCM_STATS_INCR(psock->stats.rx_msg_too_big); + desc->error = -EMSGSIZE; + psock->rx_skb_head = NULL; + kcm_abort_rx_psock(psock, EMSGSIZE, head); + break; + } else if (len <= (ssize_t)head->len - + skb->len - rxm->offset) { + /* Length must be into new skb (and also + * greater than zero) + */ + KCM_STATS_INCR(psock->stats.rx_bad_hdr_len); + desc->error = -EPROTO; + psock->rx_skb_head = NULL; + kcm_abort_rx_psock(psock, EPROTO, head); + break; + } + + rxm->full_len = len; + } + + extra = (ssize_t)(rxm->accum_len + cand_len) - rxm->full_len; + + if (extra < 0) { + /* Message not complete yet. */ + if (rxm->full_len - rxm->accum_len > + tcp_inq(psock->sk)) { + /* Don't have the whole messages in the socket + * buffer. Set psock->rx_need_bytes to wait for + * the rest of the message. Also, set "early + * eaten" since we've already buffered the skb + * but don't consume yet per tcp_read_sock. + */ + + if (!rxm->accum_len) { + /* Start RX timer for new message */ + kcm_start_rx_timer(psock); + } + + psock->rx_need_bytes = rxm->full_len - + rxm->accum_len; + rxm->accum_len += cand_len; + rxm->early_eaten = cand_len; + KCM_STATS_ADD(psock->stats.rx_bytes, cand_len); + desc->count = 0; /* Stop reading socket */ + break; + } + rxm->accum_len += cand_len; + eaten += cand_len; + WARN_ON(eaten != orig_len); + break; + } + + /* Positive extra indicates ore bytes than needed for the + * message + */ + + WARN_ON(extra > cand_len); + + eaten += (cand_len - extra); + + /* Hurray, we have a new message! */ + del_timer(&psock->rx_msg_timer); + psock->rx_skb_head = NULL; + KCM_STATS_INCR(psock->stats.rx_msgs); + +try_queue: + kcm = reserve_rx_kcm(psock, head); + if (!kcm) { + /* Unable to reserve a KCM, message is held in psock. */ + break; + } + + if (kcm_queue_rcv_skb(&kcm->sk, head)) { + /* Should mean socket buffer full */ + unreserve_rx_kcm(psock, false); + goto try_queue; + } + } + + if (cloned_orig) + kfree_skb(orig_skb); + + KCM_STATS_ADD(psock->stats.rx_bytes, eaten); + + return eaten; +} + +/* Called with lock held on lower socket */ +static int psock_tcp_read_sock(struct kcm_psock *psock) +{ + read_descriptor_t desc; + + desc.arg.data = psock; + desc.error = 0; + desc.count = 1; /* give more than one skb per call */ + + /* sk should be locked here, so okay to do tcp_read_sock */ + tcp_read_sock(psock->sk, &desc, kcm_tcp_recv); + + unreserve_rx_kcm(psock, true); + + return desc.error; +} + +/* Lower sock lock held */ +static void psock_tcp_data_ready(struct sock *sk) +{ + struct kcm_psock *psock; + + read_lock_bh(&sk->sk_callback_lock); + + psock = (struct kcm_psock *)sk->sk_user_data; + if (unlikely(!psock || psock->rx_stopped)) + goto out; + + if (psock->ready_rx_msg) + goto out; + + if (psock->rx_need_bytes) { + if (tcp_inq(sk) >= psock->rx_need_bytes) + psock->rx_need_bytes = 0; + else + goto out; + } + + if (psock_tcp_read_sock(psock) == -ENOMEM) + queue_delayed_work(kcm_wq, &psock->rx_delayed_work, 0); + +out: + read_unlock_bh(&sk->sk_callback_lock); +} + +static void do_psock_rx_work(struct kcm_psock *psock) +{ + read_descriptor_t rd_desc; + struct sock *csk = psock->sk; + + /* We need the read lock to synchronize with psock_tcp_data_ready. We + * need the socket lock for calling tcp_read_sock. + */ + lock_sock(csk); + read_lock_bh(&csk->sk_callback_lock); + + if (unlikely(csk->sk_user_data != psock)) + goto out; + + if (unlikely(psock->rx_stopped)) + goto out; + + if (psock->ready_rx_msg) + goto out; + + rd_desc.arg.data = psock; + + if (psock_tcp_read_sock(psock) == -ENOMEM) + queue_delayed_work(kcm_wq, &psock->rx_delayed_work, 0); + +out: + read_unlock_bh(&csk->sk_callback_lock); + release_sock(csk); +} + +static void psock_rx_work(struct work_struct *w) +{ + do_psock_rx_work(container_of(w, struct kcm_psock, rx_work)); +} + +static void psock_rx_delayed_work(struct work_struct *w) +{ + do_psock_rx_work(container_of(w, struct kcm_psock, + rx_delayed_work.work)); +} + +static void psock_tcp_state_change(struct sock *sk) +{ + /* TCP only does a POLLIN for a half close. Do a POLLHUP here + * since application will normally not poll with POLLIN + * on the TCP sockets. + */ + + report_csk_error(sk, EPIPE); +} + +static void psock_tcp_write_space(struct sock *sk) +{ + struct kcm_psock *psock; + struct kcm_mux *mux; + struct kcm_sock *kcm; + + read_lock_bh(&sk->sk_callback_lock); + + psock = (struct kcm_psock *)sk->sk_user_data; + if (unlikely(!psock)) + goto out; + + mux = psock->mux; + + spin_lock_bh(&mux->lock); + + /* Check if the socket is reserved so someone is waiting for sending. */ + kcm = psock->tx_kcm; + if (kcm) + queue_work(kcm_wq, &kcm->tx_work); + + spin_unlock_bh(&mux->lock); +out: + read_unlock_bh(&sk->sk_callback_lock); +} + +static void unreserve_psock(struct kcm_sock *kcm); + +/* kcm sock is locked. */ +static struct kcm_psock *reserve_psock(struct kcm_sock *kcm) +{ + struct kcm_mux *mux = kcm->mux; + struct kcm_psock *psock; + + psock = kcm->tx_psock; + + smp_rmb(); /* Must read tx_psock before tx_wait */ + + if (psock) { + WARN_ON(kcm->tx_wait); + if (unlikely(psock->tx_stopped)) + unreserve_psock(kcm); + else + return kcm->tx_psock; + } + + spin_lock_bh(&mux->lock); + + /* Check again under lock to see if psock was reserved for this + * psock via psock_unreserve. + */ + psock = kcm->tx_psock; + if (unlikely(psock)) { + WARN_ON(kcm->tx_wait); + spin_unlock_bh(&mux->lock); + return kcm->tx_psock; + } + + if (!list_empty(&mux->psocks_avail)) { + psock = list_first_entry(&mux->psocks_avail, + struct kcm_psock, + psock_avail_list); + list_del(&psock->psock_avail_list); + if (kcm->tx_wait) { + list_del(&kcm->wait_psock_list); + kcm->tx_wait = false; + } + kcm->tx_psock = psock; + psock->tx_kcm = kcm; + KCM_STATS_INCR(psock->stats.reserved); + } else if (!kcm->tx_wait) { + list_add_tail(&kcm->wait_psock_list, + &mux->kcm_tx_waiters); + kcm->tx_wait = true; + } + + spin_unlock_bh(&mux->lock); + + return psock; +} + +/* mux lock held */ +static void psock_now_avail(struct kcm_psock *psock) +{ + struct kcm_mux *mux = psock->mux; + struct kcm_sock *kcm; + + if (list_empty(&mux->kcm_tx_waiters)) { + list_add_tail(&psock->psock_avail_list, + &mux->psocks_avail); + } else { + kcm = list_first_entry(&mux->kcm_tx_waiters, + struct kcm_sock, + wait_psock_list); + list_del(&kcm->wait_psock_list); + kcm->tx_wait = false; + psock->tx_kcm = kcm; + + /* Commit before changing tx_psock since that is read in + * reserve_psock before queuing work. + */ + smp_mb(); + + kcm->tx_psock = psock; + KCM_STATS_INCR(psock->stats.reserved); + queue_work(kcm_wq, &kcm->tx_work); + } +} + +/* kcm sock is locked. */ +static void unreserve_psock(struct kcm_sock *kcm) +{ + struct kcm_psock *psock; + struct kcm_mux *mux = kcm->mux; + + spin_lock_bh(&mux->lock); + + psock = kcm->tx_psock; + + if (WARN_ON(!psock)) { + spin_unlock_bh(&mux->lock); + return; + } + + smp_rmb(); /* Read tx_psock before tx_wait */ + + kcm_update_tx_mux_stats(mux, psock); + + WARN_ON(kcm->tx_wait); + + kcm->tx_psock = NULL; + psock->tx_kcm = NULL; + KCM_STATS_INCR(psock->stats.unreserved); + + if (unlikely(psock->tx_stopped)) { + if (psock->done) { + /* Deferred free */ + list_del(&psock->psock_list); + mux->psocks_cnt--; + sock_put(psock->sk); + fput(psock->sk->sk_socket->file); + kmem_cache_free(kcm_psockp, psock); + } + + /* Don't put back on available list */ + + spin_unlock_bh(&mux->lock); + + return; + } + + psock_now_avail(psock); + + spin_unlock_bh(&mux->lock); +} + +static void kcm_report_tx_retry(struct kcm_sock *kcm) +{ + struct kcm_mux *mux = kcm->mux; + + spin_lock_bh(&mux->lock); + KCM_STATS_INCR(mux->stats.tx_retries); + spin_unlock_bh(&mux->lock); +} + +/* Write any messages ready on the kcm socket. Called with kcm sock lock + * held. Return bytes actually sent or error. + */ +static int kcm_write_msgs(struct kcm_sock *kcm) +{ + struct sock *sk = &kcm->sk; + struct kcm_psock *psock; + struct sk_buff *skb, *head; + struct kcm_tx_msg *txm; + unsigned short fragidx, frag_offset; + unsigned int sent, total_sent = 0; + int ret = 0; + + kcm->tx_wait_more = false; + psock = kcm->tx_psock; + if (unlikely(psock && psock->tx_stopped)) { + /* A reserved psock was aborted asynchronously. Unreserve + * it and we'll retry the message. + */ + unreserve_psock(kcm); + kcm_report_tx_retry(kcm); + if (skb_queue_empty(&sk->sk_write_queue)) + return 0; + + kcm_tx_msg(skb_peek(&sk->sk_write_queue))->sent = 0; + + } else if (skb_queue_empty(&sk->sk_write_queue)) { + return 0; + } + + head = skb_peek(&sk->sk_write_queue); + txm = kcm_tx_msg(head); + + if (txm->sent) { + /* Send of first skbuff in queue already in progress */ + if (WARN_ON(!psock)) { + ret = -EINVAL; + goto out; + } + sent = txm->sent; + frag_offset = txm->frag_offset; + fragidx = txm->fragidx; + skb = txm->frag_skb; + + goto do_frag; + } + +try_again: + psock = reserve_psock(kcm); + if (!psock) + goto out; + + do { + skb = head; + txm = kcm_tx_msg(head); + sent = 0; + +do_frag_list: + if (WARN_ON(!skb_shinfo(skb)->nr_frags)) { + ret = -EINVAL; + goto out; + } + + for (fragidx = 0; fragidx < skb_shinfo(skb)->nr_frags; + fragidx++) { + skb_frag_t *frag; + + frag_offset = 0; +do_frag: + frag = &skb_shinfo(skb)->frags[fragidx]; + if (WARN_ON(!frag->size)) { + ret = -EINVAL; + goto out; + } + + ret = kernel_sendpage(psock->sk->sk_socket, + frag->page.p, + frag->page_offset + frag_offset, + frag->size - frag_offset, + MSG_DONTWAIT); + if (ret <= 0) { + if (ret == -EAGAIN) { + /* Save state to try again when there's + * write space on the socket + */ + txm->sent = sent; + txm->frag_offset = frag_offset; + txm->fragidx = fragidx; + txm->frag_skb = skb; + + ret = 0; + goto out; + } + + /* Hard failure in sending message, abort this + * psock since it has lost framing + * synchonization and retry sending the + * message from the beginning. + */ + kcm_abort_tx_psock(psock, ret ? -ret : EPIPE, + true); + unreserve_psock(kcm); + + txm->sent = 0; + kcm_report_tx_retry(kcm); + ret = 0; + + goto try_again; + } + + sent += ret; + frag_offset += ret; + KCM_STATS_ADD(psock->stats.tx_bytes, ret); + if (frag_offset < frag->size) { + /* Not finished with this frag */ + goto do_frag; + } + } + + if (skb == head) { + if (skb_has_frag_list(skb)) { + skb = skb_shinfo(skb)->frag_list; + goto do_frag_list; + } + } else if (skb->next) { + skb = skb->next; + goto do_frag_list; + } + + /* Successfully sent the whole packet, account for it. */ + skb_dequeue(&sk->sk_write_queue); + kfree_skb(head); + sk->sk_wmem_queued -= sent; + total_sent += sent; + KCM_STATS_INCR(psock->stats.tx_msgs); + } while ((head = skb_peek(&sk->sk_write_queue))); +out: + if (!head) { + /* Done with all queued messages. */ + WARN_ON(!skb_queue_empty(&sk->sk_write_queue)); + unreserve_psock(kcm); + } + + /* Check if write space is available */ + sk->sk_write_space(sk); + + return total_sent ? : ret; +} + +static void kcm_tx_work(struct work_struct *w) +{ + struct kcm_sock *kcm = container_of(w, struct kcm_sock, tx_work); + struct sock *sk = &kcm->sk; + int err; + + lock_sock(sk); + + /* Primarily for SOCK_DGRAM sockets, also handle asynchronous tx + * aborts + */ + err = kcm_write_msgs(kcm); + if (err < 0) { + /* Hard failure in write, report error on KCM socket */ + pr_warn("KCM: Hard failure on kcm_write_msgs %d\n", err); + report_csk_error(&kcm->sk, -err); + goto out; + } + + /* Primarily for SOCK_SEQPACKET sockets */ + if (likely(sk->sk_socket) && + test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) { + clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + sk->sk_write_space(sk); + } + +out: + release_sock(sk); +} + +static void kcm_push(struct kcm_sock *kcm) +{ + if (kcm->tx_wait_more) + kcm_write_msgs(kcm); +} + +static ssize_t kcm_sendpage(struct socket *sock, struct page *page, + int offset, size_t size, int flags) + +{ + struct sock *sk = sock->sk; + struct kcm_sock *kcm = kcm_sk(sk); + struct sk_buff *skb = NULL, *head = NULL; + long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); + bool eor; + int err = 0; + int i; + + if (flags & MSG_SENDPAGE_NOTLAST) + flags |= MSG_MORE; + + /* No MSG_EOR from splice, only look at MSG_MORE */ + eor = !(flags & MSG_MORE); + + lock_sock(sk); + + sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); + + err = -EPIPE; + if (sk->sk_err) + goto out_error; + + if (kcm->seq_skb) { + /* Previously opened message */ + head = kcm->seq_skb; + skb = kcm_tx_msg(head)->last_skb; + i = skb_shinfo(skb)->nr_frags; + + if (skb_can_coalesce(skb, i, page, offset)) { + skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size); + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; + goto coalesced; + } + + if (i >= MAX_SKB_FRAGS) { + struct sk_buff *tskb; + + tskb = alloc_skb(0, sk->sk_allocation); + while (!tskb) { + kcm_push(kcm); + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_error; + } + + if (head == skb) + skb_shinfo(head)->frag_list = tskb; + else + skb->next = tskb; + + skb = tskb; + skb->ip_summed = CHECKSUM_UNNECESSARY; + i = 0; + } + } else { + /* Call the sk_stream functions to manage the sndbuf mem. */ + if (!sk_stream_memory_free(sk)) { + kcm_push(kcm); + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_error; + } + + head = alloc_skb(0, sk->sk_allocation); + while (!head) { + kcm_push(kcm); + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_error; + } + + skb = head; + i = 0; + } + + get_page(page); + skb_fill_page_desc(skb, i, page, offset, size); + skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; + +coalesced: + skb->len += size; + skb->data_len += size; + skb->truesize += size; + sk->sk_wmem_queued += size; + sk_mem_charge(sk, size); + + if (head != skb) { + head->len += size; + head->data_len += size; + head->truesize += size; + } + + if (eor) { + bool not_busy = skb_queue_empty(&sk->sk_write_queue); + + /* Message complete, queue it on send buffer */ + __skb_queue_tail(&sk->sk_write_queue, head); + kcm->seq_skb = NULL; + KCM_STATS_INCR(kcm->stats.tx_msgs); + + if (flags & MSG_BATCH) { + kcm->tx_wait_more = true; + } else if (kcm->tx_wait_more || not_busy) { + err = kcm_write_msgs(kcm); + if (err < 0) { + /* We got a hard error in write_msgs but have + * already queued this message. Report an error + * in the socket, but don't affect return value + * from sendmsg + */ + pr_warn("KCM: Hard failure on kcm_write_msgs\n"); + report_csk_error(&kcm->sk, -err); + } + } + } else { + /* Message not complete, save state */ + kcm->seq_skb = head; + kcm_tx_msg(head)->last_skb = skb; + } + + KCM_STATS_ADD(kcm->stats.tx_bytes, size); + + release_sock(sk); + return size; + +out_error: + kcm_push(kcm); + + err = sk_stream_error(sk, flags, err); + + /* make sure we wake any epoll edge trigger waiter */ + if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN)) + sk->sk_write_space(sk); + + release_sock(sk); + return err; +} + +static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) +{ + struct sock *sk = sock->sk; + struct kcm_sock *kcm = kcm_sk(sk); + struct sk_buff *skb = NULL, *head = NULL; + size_t copy, copied = 0; + long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); + int eor = (sock->type == SOCK_DGRAM) ? + !(msg->msg_flags & MSG_MORE) : !!(msg->msg_flags & MSG_EOR); + int err = -EPIPE; + + lock_sock(sk); + + /* Per tcp_sendmsg this should be in poll */ + sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); + + if (sk->sk_err) + goto out_error; + + if (kcm->seq_skb) { + /* Previously opened message */ + head = kcm->seq_skb; + skb = kcm_tx_msg(head)->last_skb; + goto start; + } + + /* Call the sk_stream functions to manage the sndbuf mem. */ + if (!sk_stream_memory_free(sk)) { + kcm_push(kcm); + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_error; + } + + /* New message, alloc head skb */ + head = alloc_skb(0, sk->sk_allocation); + while (!head) { + kcm_push(kcm); + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_error; + + head = alloc_skb(0, sk->sk_allocation); + } + + skb = head; + + /* Set ip_summed to CHECKSUM_UNNECESSARY to avoid calling + * csum_and_copy_from_iter from skb_do_copy_data_nocache. + */ + skb->ip_summed = CHECKSUM_UNNECESSARY; + +start: + while (msg_data_left(msg)) { + bool merge = true; + int i = skb_shinfo(skb)->nr_frags; + struct page_frag *pfrag = sk_page_frag(sk); + + if (!sk_page_frag_refill(sk, pfrag)) + goto wait_for_memory; + + if (!skb_can_coalesce(skb, i, pfrag->page, + pfrag->offset)) { + if (i == MAX_SKB_FRAGS) { + struct sk_buff *tskb; + + tskb = alloc_skb(0, sk->sk_allocation); + if (!tskb) + goto wait_for_memory; + + if (head == skb) + skb_shinfo(head)->frag_list = tskb; + else + skb->next = tskb; + + skb = tskb; + skb->ip_summed = CHECKSUM_UNNECESSARY; + continue; + } + merge = false; + } + + copy = min_t(int, msg_data_left(msg), + pfrag->size - pfrag->offset); + + if (!sk_wmem_schedule(sk, copy)) + goto wait_for_memory; + + err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, + pfrag->page, + pfrag->offset, + copy); + if (err) + goto out_error; + + /* Update the skb. */ + if (merge) { + skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); + } else { + skb_fill_page_desc(skb, i, pfrag->page, + pfrag->offset, copy); + get_page(pfrag->page); + } + + pfrag->offset += copy; + copied += copy; + if (head != skb) { + head->len += copy; + head->data_len += copy; + } + + continue; + +wait_for_memory: + kcm_push(kcm); + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_error; + } + + if (eor) { + bool not_busy = skb_queue_empty(&sk->sk_write_queue); + + /* Message complete, queue it on send buffer */ + __skb_queue_tail(&sk->sk_write_queue, head); + kcm->seq_skb = NULL; + KCM_STATS_INCR(kcm->stats.tx_msgs); + + if (msg->msg_flags & MSG_BATCH) { + kcm->tx_wait_more = true; + } else if (kcm->tx_wait_more || not_busy) { + err = kcm_write_msgs(kcm); + if (err < 0) { + /* We got a hard error in write_msgs but have + * already queued this message. Report an error + * in the socket, but don't affect return value + * from sendmsg + */ + pr_warn("KCM: Hard failure on kcm_write_msgs\n"); + report_csk_error(&kcm->sk, -err); + } + } + } else { + /* Message not complete, save state */ +partial_message: + kcm->seq_skb = head; + kcm_tx_msg(head)->last_skb = skb; + } + + KCM_STATS_ADD(kcm->stats.tx_bytes, copied); + + release_sock(sk); + return copied; + +out_error: + kcm_push(kcm); + + if (copied && sock->type == SOCK_SEQPACKET) { + /* Wrote some bytes before encountering an + * error, return partial success. + */ + goto partial_message; + } + + if (head != kcm->seq_skb) + kfree_skb(head); + + err = sk_stream_error(sk, msg->msg_flags, err); + + /* make sure we wake any epoll edge trigger waiter */ + if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN)) + sk->sk_write_space(sk); + + release_sock(sk); + return err; +} + +static struct sk_buff *kcm_wait_data(struct sock *sk, int flags, + long timeo, int *err) +{ + struct sk_buff *skb; + + while (!(skb = skb_peek(&sk->sk_receive_queue))) { + if (sk->sk_err) { + *err = sock_error(sk); + return NULL; + } + + if (sock_flag(sk, SOCK_DONE)) + return NULL; + + if ((flags & MSG_DONTWAIT) || !timeo) { + *err = -EAGAIN; + return NULL; + } + + sk_wait_data(sk, &timeo, NULL); + + /* Handle signals */ + if (signal_pending(current)) { + *err = sock_intr_errno(timeo); + return NULL; + } + } + + return skb; +} + +static int kcm_recvmsg(struct socket *sock, struct msghdr *msg, + size_t len, int flags) +{ + struct sock *sk = sock->sk; + struct kcm_sock *kcm = kcm_sk(sk); + int err = 0; + long timeo; + struct kcm_rx_msg *rxm; + int copied = 0; + struct sk_buff *skb; + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + + lock_sock(sk); + + skb = kcm_wait_data(sk, flags, timeo, &err); + if (!skb) + goto out; + + /* Okay, have a message on the receive queue */ + + rxm = kcm_rx_msg(skb); + + if (len > rxm->full_len) + len = rxm->full_len; + + err = skb_copy_datagram_msg(skb, rxm->offset, msg, len); + if (err < 0) + goto out; + + copied = len; + if (likely(!(flags & MSG_PEEK))) { + KCM_STATS_ADD(kcm->stats.rx_bytes, copied); + if (copied < rxm->full_len) { + if (sock->type == SOCK_DGRAM) { + /* Truncated message */ + msg->msg_flags |= MSG_TRUNC; + goto msg_finished; + } + rxm->offset += copied; + rxm->full_len -= copied; + } else { +msg_finished: + /* Finished with message */ + msg->msg_flags |= MSG_EOR; + KCM_STATS_INCR(kcm->stats.rx_msgs); + skb_unlink(skb, &sk->sk_receive_queue); + kfree_skb(skb); + } + } + +out: + release_sock(sk); + + return copied ? : err; +} + +static ssize_t kcm_sock_splice(struct sock *sk, + struct pipe_inode_info *pipe, + struct splice_pipe_desc *spd) +{ + int ret; + + release_sock(sk); + ret = splice_to_pipe(pipe, spd); + lock_sock(sk); + + return ret; +} + +static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + struct sock *sk = sock->sk; + struct kcm_sock *kcm = kcm_sk(sk); + long timeo; + struct kcm_rx_msg *rxm; + int err = 0; + size_t copied; + struct sk_buff *skb; + + /* Only support splice for SOCKSEQPACKET */ + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + + lock_sock(sk); + + skb = kcm_wait_data(sk, flags, timeo, &err); + if (!skb) + goto err_out; + + /* Okay, have a message on the receive queue */ + + rxm = kcm_rx_msg(skb); + + if (len > rxm->full_len) + len = rxm->full_len; + + copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags, + kcm_sock_splice); + if (copied < 0) { + err = copied; + goto err_out; + } + + KCM_STATS_ADD(kcm->stats.rx_bytes, copied); + + rxm->offset += copied; + rxm->full_len -= copied; + + /* We have no way to return MSG_EOR. If all the bytes have been + * read we still leave the message in the receive socket buffer. + * A subsequent recvmsg needs to be done to return MSG_EOR and + * finish reading the message. + */ + + release_sock(sk); + + return copied; + +err_out: + release_sock(sk); + + return err; +} + +/* kcm sock lock held */ +static void kcm_recv_disable(struct kcm_sock *kcm) +{ + struct kcm_mux *mux = kcm->mux; + + if (kcm->rx_disabled) + return; + + spin_lock_bh(&mux->rx_lock); + + kcm->rx_disabled = 1; + + /* If a psock is reserved we'll do cleanup in unreserve */ + if (!kcm->rx_psock) { + if (kcm->rx_wait) { + list_del(&kcm->wait_rx_list); + kcm->rx_wait = false; + } + + requeue_rx_msgs(mux, &kcm->sk.sk_receive_queue); + } + + spin_unlock_bh(&mux->rx_lock); +} + +/* kcm sock lock held */ +static void kcm_recv_enable(struct kcm_sock *kcm) +{ + struct kcm_mux *mux = kcm->mux; + + if (!kcm->rx_disabled) + return; + + spin_lock_bh(&mux->rx_lock); + + kcm->rx_disabled = 0; + kcm_rcv_ready(kcm); + + spin_unlock_bh(&mux->rx_lock); +} + +static int kcm_setsockopt(struct socket *sock, int level, int optname, + char __user *optval, unsigned int optlen) +{ + struct kcm_sock *kcm = kcm_sk(sock->sk); + int val, valbool; + int err = 0; + + if (level != SOL_KCM) + return -ENOPROTOOPT; + + if (optlen < sizeof(int)) + return -EINVAL; + + if (get_user(val, (int __user *)optval)) + return -EINVAL; + + valbool = val ? 1 : 0; + + switch (optname) { + case KCM_RECV_DISABLE: + lock_sock(&kcm->sk); + if (valbool) + kcm_recv_disable(kcm); + else + kcm_recv_enable(kcm); + release_sock(&kcm->sk); + break; + default: + err = -ENOPROTOOPT; + } + + return err; +} + +static int kcm_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen) +{ + struct kcm_sock *kcm = kcm_sk(sock->sk); + int val, len; + + if (level != SOL_KCM) + return -ENOPROTOOPT; + + if (get_user(len, optlen)) + return -EFAULT; + + len = min_t(unsigned int, len, sizeof(int)); + if (len < 0) + return -EINVAL; + + switch (optname) { + case KCM_RECV_DISABLE: + val = kcm->rx_disabled; + break; + default: + return -ENOPROTOOPT; + } + + if (put_user(len, optlen)) + return -EFAULT; + if (copy_to_user(optval, &val, len)) + return -EFAULT; + return 0; +} + +static void init_kcm_sock(struct kcm_sock *kcm, struct kcm_mux *mux) +{ + struct kcm_sock *tkcm; + struct list_head *head; + int index = 0; + + /* For SOCK_SEQPACKET sock type, datagram_poll checks the sk_state, so + * we set sk_state, otherwise epoll_wait always returns right away with + * POLLHUP + */ + kcm->sk.sk_state = TCP_ESTABLISHED; + + /* Add to mux's kcm sockets list */ + kcm->mux = mux; + spin_lock_bh(&mux->lock); + + head = &mux->kcm_socks; + list_for_each_entry(tkcm, &mux->kcm_socks, kcm_sock_list) { + if (tkcm->index != index) + break; + head = &tkcm->kcm_sock_list; + index++; + } + + list_add(&kcm->kcm_sock_list, head); + kcm->index = index; + + mux->kcm_socks_cnt++; + spin_unlock_bh(&mux->lock); + + INIT_WORK(&kcm->tx_work, kcm_tx_work); + + spin_lock_bh(&mux->rx_lock); + kcm_rcv_ready(kcm); + spin_unlock_bh(&mux->rx_lock); +} + +static void kcm_rx_msg_timeout(unsigned long arg) +{ + struct kcm_psock *psock = (struct kcm_psock *)arg; + + /* Message assembly timed out */ + KCM_STATS_INCR(psock->stats.rx_msg_timeouts); + kcm_abort_rx_psock(psock, ETIMEDOUT, NULL); +} + +static int kcm_attach(struct socket *sock, struct socket *csock, + struct bpf_prog *prog) +{ + struct kcm_sock *kcm = kcm_sk(sock->sk); + struct kcm_mux *mux = kcm->mux; + struct sock *csk; + struct kcm_psock *psock = NULL, *tpsock; + struct list_head *head; + int index = 0; + + if (csock->ops->family != PF_INET && + csock->ops->family != PF_INET6) + return -EINVAL; + + csk = csock->sk; + if (!csk) + return -EINVAL; + + /* Only support TCP for now */ + if (csk->sk_protocol != IPPROTO_TCP) + return -EINVAL; + + psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL); + if (!psock) + return -ENOMEM; + + psock->mux = mux; + psock->sk = csk; + psock->bpf_prog = prog; + + setup_timer(&psock->rx_msg_timer, kcm_rx_msg_timeout, + (unsigned long)psock); + + INIT_WORK(&psock->rx_work, psock_rx_work); + INIT_DELAYED_WORK(&psock->rx_delayed_work, psock_rx_delayed_work); + + sock_hold(csk); + + write_lock_bh(&csk->sk_callback_lock); + psock->save_data_ready = csk->sk_data_ready; + psock->save_write_space = csk->sk_write_space; + psock->save_state_change = csk->sk_state_change; + csk->sk_user_data = psock; + csk->sk_data_ready = psock_tcp_data_ready; + csk->sk_write_space = psock_tcp_write_space; + csk->sk_state_change = psock_tcp_state_change; + write_unlock_bh(&csk->sk_callback_lock); + + /* Finished initialization, now add the psock to the MUX. */ + spin_lock_bh(&mux->lock); + head = &mux->psocks; + list_for_each_entry(tpsock, &mux->psocks, psock_list) { + if (tpsock->index != index) + break; + head = &tpsock->psock_list; + index++; + } + + list_add(&psock->psock_list, head); + psock->index = index; + + KCM_STATS_INCR(mux->stats.psock_attach); + mux->psocks_cnt++; + psock_now_avail(psock); + spin_unlock_bh(&mux->lock); + + /* Schedule RX work in case there are already bytes queued */ + queue_work(kcm_wq, &psock->rx_work); + + return 0; +} + +static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info) +{ + struct socket *csock; + struct bpf_prog *prog; + int err; + + csock = sockfd_lookup(info->fd, &err); + if (!csock) + return -ENOENT; + + prog = bpf_prog_get(info->bpf_fd); + if (IS_ERR(prog)) { + err = PTR_ERR(prog); + goto out; + } + + if (prog->type != BPF_PROG_TYPE_SOCKET_FILTER) { + bpf_prog_put(prog); + err = -EINVAL; + goto out; + } + + err = kcm_attach(sock, csock, prog); + if (err) { + bpf_prog_put(prog); + goto out; + } + + /* Keep reference on file also */ + + return 0; +out: + fput(csock->file); + return err; +} + +static void kcm_unattach(struct kcm_psock *psock) +{ + struct sock *csk = psock->sk; + struct kcm_mux *mux = psock->mux; + + /* Stop getting callbacks from TCP socket. After this there should + * be no way to reserve a kcm for this psock. + */ + write_lock_bh(&csk->sk_callback_lock); + csk->sk_user_data = NULL; + csk->sk_data_ready = psock->save_data_ready; + csk->sk_write_space = psock->save_write_space; + csk->sk_state_change = psock->save_state_change; + psock->rx_stopped = 1; + + if (WARN_ON(psock->rx_kcm)) { + write_unlock_bh(&csk->sk_callback_lock); + return; + } + + spin_lock_bh(&mux->rx_lock); + + /* Stop receiver activities. After this point psock should not be + * able to get onto ready list either through callbacks or work. + */ + if (psock->ready_rx_msg) { + list_del(&psock->psock_ready_list); + kfree_skb(psock->ready_rx_msg); + psock->ready_rx_msg = NULL; + KCM_STATS_INCR(mux->stats.rx_ready_drops); + } + + spin_unlock_bh(&mux->rx_lock); + + write_unlock_bh(&csk->sk_callback_lock); + + del_timer_sync(&psock->rx_msg_timer); + cancel_work_sync(&psock->rx_work); + cancel_delayed_work_sync(&psock->rx_delayed_work); + + bpf_prog_put(psock->bpf_prog); + + kfree_skb(psock->rx_skb_head); + psock->rx_skb_head = NULL; + + spin_lock_bh(&mux->lock); + + aggregate_psock_stats(&psock->stats, &mux->aggregate_psock_stats); + + KCM_STATS_INCR(mux->stats.psock_unattach); + + if (psock->tx_kcm) { + /* psock was reserved. Just mark it finished and we will clean + * up in the kcm paths, we need kcm lock which can not be + * acquired here. + */ + KCM_STATS_INCR(mux->stats.psock_unattach_rsvd); + spin_unlock_bh(&mux->lock); + + /* We are unattaching a socket that is reserved. Abort the + * socket since we may be out of sync in sending on it. We need + * to do this without the mux lock. + */ + kcm_abort_tx_psock(psock, EPIPE, false); + + spin_lock_bh(&mux->lock); + if (!psock->tx_kcm) { + /* psock now unreserved in window mux was unlocked */ + goto no_reserved; + } + psock->done = 1; + + /* Commit done before queuing work to process it */ + smp_mb(); + + /* Queue tx work to make sure psock->done is handled */ + queue_work(kcm_wq, &psock->tx_kcm->tx_work); + spin_unlock_bh(&mux->lock); + } else { +no_reserved: + if (!psock->tx_stopped) + list_del(&psock->psock_avail_list); + list_del(&psock->psock_list); + mux->psocks_cnt--; + spin_unlock_bh(&mux->lock); + + sock_put(csk); + fput(csk->sk_socket->file); + kmem_cache_free(kcm_psockp, psock); + } +} + +static int kcm_unattach_ioctl(struct socket *sock, struct kcm_unattach *info) +{ + struct kcm_sock *kcm = kcm_sk(sock->sk); + struct kcm_mux *mux = kcm->mux; + struct kcm_psock *psock; + struct socket *csock; + struct sock *csk; + int err; + + csock = sockfd_lookup(info->fd, &err); + if (!csock) + return -ENOENT; + + csk = csock->sk; + if (!csk) { + err = -EINVAL; + goto out; + } + + err = -ENOENT; + + spin_lock_bh(&mux->lock); + + list_for_each_entry(psock, &mux->psocks, psock_list) { + if (psock->sk != csk) + continue; + + /* Found the matching psock */ + + if (psock->unattaching || WARN_ON(psock->done)) { + err = -EALREADY; + break; + } + + psock->unattaching = 1; + + spin_unlock_bh(&mux->lock); + + kcm_unattach(psock); + + err = 0; + goto out; + } + + spin_unlock_bh(&mux->lock); + +out: + fput(csock->file); + return err; +} + +static struct proto kcm_proto = { + .name = "KCM", + .owner = THIS_MODULE, + .obj_size = sizeof(struct kcm_sock), +}; + +/* Clone a kcm socket. */ +static int kcm_clone(struct socket *osock, struct kcm_clone *info, + struct socket **newsockp) +{ + struct socket *newsock; + struct sock *newsk; + struct file *newfile; + int err, newfd; + + err = -ENFILE; + newsock = sock_alloc(); + if (!newsock) + goto out; + + newsock->type = osock->type; + newsock->ops = osock->ops; + + __module_get(newsock->ops->owner); + + newfd = get_unused_fd_flags(0); + if (unlikely(newfd < 0)) { + err = newfd; + goto out_fd_fail; + } + + newfile = sock_alloc_file(newsock, 0, osock->sk->sk_prot_creator->name); + if (unlikely(IS_ERR(newfile))) { + err = PTR_ERR(newfile); + goto out_sock_alloc_fail; + } + + newsk = sk_alloc(sock_net(osock->sk), PF_KCM, GFP_KERNEL, + &kcm_proto, true); + if (!newsk) { + err = -ENOMEM; + goto out_sk_alloc_fail; + } + + sock_init_data(newsock, newsk); + init_kcm_sock(kcm_sk(newsk), kcm_sk(osock->sk)->mux); + + fd_install(newfd, newfile); + *newsockp = newsock; + info->fd = newfd; + + return 0; + +out_sk_alloc_fail: + fput(newfile); +out_sock_alloc_fail: + put_unused_fd(newfd); +out_fd_fail: + sock_release(newsock); +out: + return err; +} + +static int kcm_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) +{ + int err; + + switch (cmd) { + case SIOCKCMATTACH: { + struct kcm_attach info; + + if (copy_from_user(&info, (void __user *)arg, sizeof(info))) + err = -EFAULT; + + err = kcm_attach_ioctl(sock, &info); + + break; + } + case SIOCKCMUNATTACH: { + struct kcm_unattach info; + + if (copy_from_user(&info, (void __user *)arg, sizeof(info))) + err = -EFAULT; + + err = kcm_unattach_ioctl(sock, &info); + + break; + } + case SIOCKCMCLONE: { + struct kcm_clone info; + struct socket *newsock = NULL; + + if (copy_from_user(&info, (void __user *)arg, sizeof(info))) + err = -EFAULT; + + err = kcm_clone(sock, &info, &newsock); + + if (!err) { + if (copy_to_user((void __user *)arg, &info, + sizeof(info))) { + err = -EFAULT; + sock_release(newsock); + } + } + + break; + } + default: + err = -ENOIOCTLCMD; + break; + } + + return err; +} + +static void free_mux(struct rcu_head *rcu) +{ + struct kcm_mux *mux = container_of(rcu, + struct kcm_mux, rcu); + + kmem_cache_free(kcm_muxp, mux); +} + +static void release_mux(struct kcm_mux *mux) +{ + struct kcm_net *knet = mux->knet; + struct kcm_psock *psock, *tmp_psock; + + /* Release psocks */ + list_for_each_entry_safe(psock, tmp_psock, + &mux->psocks, psock_list) { + if (!WARN_ON(psock->unattaching)) + kcm_unattach(psock); + } + + if (WARN_ON(mux->psocks_cnt)) + return; + + __skb_queue_purge(&mux->rx_hold_queue); + + mutex_lock(&knet->mutex); + aggregate_mux_stats(&mux->stats, &knet->aggregate_mux_stats); + aggregate_psock_stats(&mux->aggregate_psock_stats, + &knet->aggregate_psock_stats); + list_del_rcu(&mux->kcm_mux_list); + knet->count--; + mutex_unlock(&knet->mutex); + + call_rcu(&mux->rcu, free_mux); +} + +static void kcm_done(struct kcm_sock *kcm) +{ + struct kcm_mux *mux = kcm->mux; + struct sock *sk = &kcm->sk; + int socks_cnt; + + spin_lock_bh(&mux->rx_lock); + if (kcm->rx_psock) { + /* Cleanup in unreserve_rx_kcm */ + WARN_ON(kcm->done); + kcm->rx_disabled = 1; + kcm->done = 1; + spin_unlock_bh(&mux->rx_lock); + return; + } + + if (kcm->rx_wait) { + list_del(&kcm->wait_rx_list); + kcm->rx_wait = false; + } + /* Move any pending receive messages to other kcm sockets */ + requeue_rx_msgs(mux, &sk->sk_receive_queue); + + spin_unlock_bh(&mux->rx_lock); + + if (WARN_ON(sk_rmem_alloc_get(sk))) + return; + + /* Detach from MUX */ + spin_lock_bh(&mux->lock); + + list_del(&kcm->kcm_sock_list); + mux->kcm_socks_cnt--; + socks_cnt = mux->kcm_socks_cnt; + + spin_unlock_bh(&mux->lock); + + if (!socks_cnt) { + /* We are done with the mux now. */ + release_mux(mux); + } + + WARN_ON(kcm->rx_wait); + + sock_put(&kcm->sk); +} + +/* Called by kcm_release to close a KCM socket. + * If this is the last KCM socket on the MUX, destroy the MUX. + */ +static int kcm_release(struct socket *sock) +{ + struct sock *sk = sock->sk; + struct kcm_sock *kcm; + struct kcm_mux *mux; + struct kcm_psock *psock; + + if (!sk) + return 0; + + kcm = kcm_sk(sk); + mux = kcm->mux; + + sock_orphan(sk); + kfree_skb(kcm->seq_skb); + + lock_sock(sk); + /* Purge queue under lock to avoid race condition with tx_work trying + * to act when queue is nonempty. If tx_work runs after this point + * it will just return. + */ + __skb_queue_purge(&sk->sk_write_queue); + release_sock(sk); + + spin_lock_bh(&mux->lock); + if (kcm->tx_wait) { + /* Take of tx_wait list, after this point there should be no way + * that a psock will be assigned to this kcm. + */ + list_del(&kcm->wait_psock_list); + kcm->tx_wait = false; + } + spin_unlock_bh(&mux->lock); + + /* Cancel work. After this point there should be no outside references + * to the kcm socket. + */ + cancel_work_sync(&kcm->tx_work); + + lock_sock(sk); + psock = kcm->tx_psock; + if (psock) { + /* A psock was reserved, so we need to kill it since it + * may already have some bytes queued from a message. We + * need to do this after removing kcm from tx_wait list. + */ + kcm_abort_tx_psock(psock, EPIPE, false); + unreserve_psock(kcm); + } + release_sock(sk); + + WARN_ON(kcm->tx_wait); + WARN_ON(kcm->tx_psock); + + sock->sk = NULL; + + kcm_done(kcm); + + return 0; +} + +static const struct proto_ops kcm_dgram_ops = { + .family = PF_KCM, + .owner = THIS_MODULE, + .release = kcm_release, + .bind = sock_no_bind, + .connect = sock_no_connect, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = sock_no_getname, + .poll = datagram_poll, + .ioctl = kcm_ioctl, + .listen = sock_no_listen, + .shutdown = sock_no_shutdown, + .setsockopt = kcm_setsockopt, + .getsockopt = kcm_getsockopt, + .sendmsg = kcm_sendmsg, + .recvmsg = kcm_recvmsg, + .mmap = sock_no_mmap, + .sendpage = kcm_sendpage, +}; + +static const struct proto_ops kcm_seqpacket_ops = { + .family = PF_KCM, + .owner = THIS_MODULE, + .release = kcm_release, + .bind = sock_no_bind, + .connect = sock_no_connect, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = sock_no_getname, + .poll = datagram_poll, + .ioctl = kcm_ioctl, + .listen = sock_no_listen, + .shutdown = sock_no_shutdown, + .setsockopt = kcm_setsockopt, + .getsockopt = kcm_getsockopt, + .sendmsg = kcm_sendmsg, + .recvmsg = kcm_recvmsg, + .mmap = sock_no_mmap, + .sendpage = kcm_sendpage, + .splice_read = kcm_splice_read, +}; + +/* Create proto operation for kcm sockets */ +static int kcm_create(struct net *net, struct socket *sock, + int protocol, int kern) +{ + struct kcm_net *knet = net_generic(net, kcm_net_id); + struct sock *sk; + struct kcm_mux *mux; + + switch (sock->type) { + case SOCK_DGRAM: + sock->ops = &kcm_dgram_ops; + break; + case SOCK_SEQPACKET: + sock->ops = &kcm_seqpacket_ops; + break; + default: + return -ESOCKTNOSUPPORT; + } + + if (protocol != KCMPROTO_CONNECTED) + return -EPROTONOSUPPORT; + + sk = sk_alloc(net, PF_KCM, GFP_KERNEL, &kcm_proto, kern); + if (!sk) + return -ENOMEM; + + /* Allocate a kcm mux, shared between KCM sockets */ + mux = kmem_cache_zalloc(kcm_muxp, GFP_KERNEL); + if (!mux) { + sk_free(sk); + return -ENOMEM; + } + + spin_lock_init(&mux->lock); + spin_lock_init(&mux->rx_lock); + INIT_LIST_HEAD(&mux->kcm_socks); + INIT_LIST_HEAD(&mux->kcm_rx_waiters); + INIT_LIST_HEAD(&mux->kcm_tx_waiters); + + INIT_LIST_HEAD(&mux->psocks); + INIT_LIST_HEAD(&mux->psocks_ready); + INIT_LIST_HEAD(&mux->psocks_avail); + + mux->knet = knet; + + /* Add new MUX to list */ + mutex_lock(&knet->mutex); + list_add_rcu(&mux->kcm_mux_list, &knet->mux_list); + knet->count++; + mutex_unlock(&knet->mutex); + + skb_queue_head_init(&mux->rx_hold_queue); + + /* Init KCM socket */ + sock_init_data(sock, sk); + init_kcm_sock(kcm_sk(sk), mux); + + return 0; +} + +static struct net_proto_family kcm_family_ops = { + .family = PF_KCM, + .create = kcm_create, + .owner = THIS_MODULE, +}; + +static __net_init int kcm_init_net(struct net *net) +{ + struct kcm_net *knet = net_generic(net, kcm_net_id); + + INIT_LIST_HEAD_RCU(&knet->mux_list); + mutex_init(&knet->mutex); + + return 0; +} + +static __net_exit void kcm_exit_net(struct net *net) +{ + struct kcm_net *knet = net_generic(net, kcm_net_id); + + /* All KCM sockets should be closed at this point, which should mean + * that all multiplexors and psocks have been destroyed. + */ + WARN_ON(!list_empty(&knet->mux_list)); +} + +static struct pernet_operations kcm_net_ops = { + .init = kcm_init_net, + .exit = kcm_exit_net, + .id = &kcm_net_id, + .size = sizeof(struct kcm_net), +}; + +static int __init kcm_init(void) +{ + int err = -ENOMEM; + + kcm_muxp = kmem_cache_create("kcm_mux_cache", + sizeof(struct kcm_mux), 0, + SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + if (!kcm_muxp) + goto fail; + + kcm_psockp = kmem_cache_create("kcm_psock_cache", + sizeof(struct kcm_psock), 0, + SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + if (!kcm_psockp) + goto fail; + + kcm_wq = create_singlethread_workqueue("kkcmd"); + if (!kcm_wq) + goto fail; + + err = proto_register(&kcm_proto, 1); + if (err) + goto fail; + + err = sock_register(&kcm_family_ops); + if (err) + goto sock_register_fail; + + err = register_pernet_device(&kcm_net_ops); + if (err) + goto net_ops_fail; + + err = kcm_proc_init(); + if (err) + goto proc_init_fail; + + return 0; + +proc_init_fail: + unregister_pernet_device(&kcm_net_ops); + +net_ops_fail: + sock_unregister(PF_KCM); + +sock_register_fail: + proto_unregister(&kcm_proto); + +fail: + kmem_cache_destroy(kcm_muxp); + kmem_cache_destroy(kcm_psockp); + + if (kcm_wq) + destroy_workqueue(kcm_wq); + + return err; +} + +static void __exit kcm_exit(void) +{ + kcm_proc_exit(); + unregister_pernet_device(&kcm_net_ops); + sock_unregister(PF_KCM); + proto_unregister(&kcm_proto); + destroy_workqueue(kcm_wq); + + kmem_cache_destroy(kcm_muxp); + kmem_cache_destroy(kcm_psockp); +} + +module_init(kcm_init); +module_exit(kcm_exit); + +MODULE_LICENSE("GPL"); +MODULE_ALIAS_NETPROTO(PF_KCM); + diff --git a/net/socket.c b/net/socket.c index c044d1e8508c..886649c88d8f 100644 --- a/net/socket.c +++ b/net/socket.c @@ -533,7 +533,7 @@ static const struct inode_operations sockfs_inode_ops = { * NULL is returned. */ -static struct socket *sock_alloc(void) +struct socket *sock_alloc(void) { struct inode *inode; struct socket *sock; @@ -554,6 +554,7 @@ static struct socket *sock_alloc(void) this_cpu_add(sockets_in_use, 1); return sock; } +EXPORT_SYMBOL(sock_alloc); /** * sock_release - close a socket @@ -1874,7 +1875,8 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, struct msghdr *msg_sys, unsigned int flags, - struct used_address *used_address) + struct used_address *used_address, + unsigned int allowed_msghdr_flags) { struct compat_msghdr __user *msg_compat = (struct compat_msghdr __user *)msg; @@ -1900,6 +1902,7 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, if (msg_sys->msg_controllen > INT_MAX) goto out_freeiov; + flags |= (msg_sys->msg_flags & allowed_msghdr_flags); ctl_len = msg_sys->msg_controllen; if ((MSG_CMSG_COMPAT & flags) && ctl_len) { err = @@ -1978,7 +1981,7 @@ long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned flags) if (!sock) goto out; - err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL); + err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL, 0); fput_light(sock->file, fput_needed); out: @@ -2005,6 +2008,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, struct compat_mmsghdr __user *compat_entry; struct msghdr msg_sys; struct used_address used_address; + unsigned int oflags = flags; if (vlen > UIO_MAXIOV) vlen = UIO_MAXIOV; @@ -2019,11 +2023,15 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, entry = mmsg; compat_entry = (struct compat_mmsghdr __user *)mmsg; err = 0; + flags |= MSG_BATCH; while (datagrams < vlen) { + if (datagrams == vlen - 1) + flags = oflags; + if (MSG_CMSG_COMPAT & flags) { err = ___sys_sendmsg(sock, (struct user_msghdr __user *)compat_entry, - &msg_sys, flags, &used_address); + &msg_sys, flags, &used_address, MSG_EOR); if (err < 0) break; err = __put_user(err, &compat_entry->msg_len); @@ -2031,7 +2039,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, } else { err = ___sys_sendmsg(sock, (struct user_msghdr __user *)entry, - &msg_sys, flags, &used_address); + &msg_sys, flags, &used_address, MSG_EOR); if (err < 0) break; err = put_user(err, &entry->msg_len); |