The obvious way to implement TCP and reliable transfer protocols in general involves copying into and out of socket buffers. This takes time, memory, and may reduce the throughput of a system that transfers large amounts of data.
Zerocopy TCP attempts to perform data transfer using as few copies as possible. In the best case, it does allow data to be transferred without any intermediate copying.
It works by making use of paged virtual memory.
On a write (send application data through a socket to a remote receiver):
One issue with this scheme is that the segment headers must be stored separately: they cannot be stored in the application pages that contain the payload data for the segments. (In general, the operating system should not make any changes to application memory unless explicitly requested to do so. It would be very surprising if calling write caused some in-memory data to be modified.) So, the headers must be stored in memory that is private to the socket. When the network device driver is ready to send a packet, it reads the headers from one buffer and the data from another. This is known as scatter-gather I/O and is commonly supported by network interface hardware devices.
Assuming the network device can do DMA (Direct Memory Access), the host CPU does not need to be involved in transferring the packet data from main memory to the network interface hardware.
When a packet arrives from the link layer, it is read into separate buffers for the headers and the application data. (Again, as long as DMA is supported, this can be done without involving the host CPU.) The packet is demultiplexed, and a segment containing pointers to the headers and application data is added to the receive buffer of the appropriate socket.
On reading data from a socket: map page(s) from the socket's receive buffer into the application's virtual address space. For this operation to be done transparently, the application buffer must start and end on a multiple of the hardware page size, since the operating system will map the data pages into the application's address space, and it is not possible to map less than one page of data, nor is it possible to map a page at a non-page-aligned address. Thus, for a receive (reading data from a socket) to proceed without copying, some cooperation is required from the application.
The primary function of the network layer is to route network-layer datagrams to their destination. The protocols at this layer include IP (sending and receiving datagrams), routing protocols (RIP, OSPF, BGP), and control protocols (ICMP).
Routing: general problem of ensuring that packets will arrive at their destination
Forwarding: how a single router should send a received packet on a particular output link in order to ensure that it reaches its destination.
Packet switching terminology:
In a virtual circuit network, connections are done at the network level (unlike datagram networks, where a higher-level transport protocol is required for connections).
Connection setup in a VC network reserves resources for the connection in each router along the path to the destination.
Forwarding is done by consulting the forwarding table to find a record based on the virtual circuit number of the incoming packet. The record specifies both the outgoing network link and a new VC # for the packet. (The same network connection may have many VC numbers, one per link. Trying to assign a connection the same virtual circuit number of every link would make connection setup very difficult.)
A forwarding table in a VC network: contains records
(Incoming Link, incoming VC #, outgoing link, outgoing VC #)
In a large VC network, many connections will be set up and torn down in a small time interval. Thus, the forwarding table must be updated very frequently: potentially on the order of every microsecond. This is an impediment to the scalability of VC networks.
In a datagram network, forwarding is done based on a forwarding table that is constructed based on routing information. The forwarding table consists of records
(Destination address prefix, outgoing link)
The router finds the record with the longest destination address prefix that matches the destination address of the incoming datagram. Using address prefixes rather than individual addresses in the forwarding table improves scalability: many destinations can be served by a single prefix.
Because an address prefix subsumes and entire network, or many networks, forwarding tables in a datagram network do not need to be updated very frequently. (They only need to be updated when a new network is added, or an existing network is removed.) So, routing updates on the order of once every minute or several minutes will work. This improves the scalability of datagram networks.
The general task of a router is to forward received datagrams from input links to output links. When a datagram is received, the forwarding table is consulted to find the correct output link for the datagram. The datagram is then sent via the switch fabric to the appropriate output link.
Each input and output port has an associated queue. Input datagrams may wait in the input queue while they wait for the switch fabric to be available. Output datagrams may wait in the output queue for the output link to be available.
In general, as long as the switch fabric can switch datagrams at a rate equal to the sum of the transmission rates of the input ports, there will be no significant queueing delays or drops at the input queues.
Datagrams may need to be dropped at an output port if the output queue for that port becomes too full.
The input ports and output ports operate largely autonomously: the only global state information needed is the forwarding table. When the forwarding table is updated, copies may be distributed to each input port so that each input port may make the forwarding decision in parallel (without any contention that might arise if a shared updateable data structure is used).
The critical element in the volume of data that can be handled by the router is the switching fabric. There are many ways to implement the switching fabric:
When a received packet arrives
Options for forwarding table lookup:
A crossbar switch is a commonly used interconnection network that allows some transfers from input port to output port to proceed in parallel:
An incoming packet travels in one direction (horizontally) until it reaches a switch element (intersection) where it can change directions and move towards the desired output queue. A crossbar can move datagrams in parallel as long as they are each destined for a different output port.