Register or Login To Download This Patent As A PDF
| United States Patent Application |
20010025315
|
| Kind Code
|
A1
|
|
Jolitz, Lynne G.
|
September 27, 2001
|
Term addressable memory of an accelerator system and method
Abstract
An improved term addressable memory of an accelerator system and method
includes a mechanism for performing predetermined plurality of pattern
matches of packets to classify them for use with stateful protocol
processing units that can resolve session data spread across multiple
data packets and process them for the ultimate destination. The invention
replaces a conventional content addressable memory with a term
addressable memory, whereby redundant terms are recorded with a single
memory entry. Two classes of terms are used to match packet addresses and
application ports, as well as a much smaller session CAM that matches the
aggregate match of all terms to a specific session.
| Inventors: |
Jolitz, Lynne G.; (Los Gatos, CA)
|
| Correspondence Address:
|
HAROLD D. MESSNER
1021 NEBRASKA ST.
VALLEJO
CA
94590
US
|
| Serial No.:
|
756667 |
| Series Code:
|
09
|
| Filed:
|
January 10, 2001 |
| Current U.S. Class: |
709/231 |
| Class at Publication: |
709/231 |
| International Class: |
G06F 015/16 |
Claims
What is claimed is:
1. A method of matching a predetermined plurality of patterns of a
stream-oriented protocol involving a network communications device, for
data packets having a formatted header containing information about the
packet, the method comprising analyzing packet traffic on the network to
identify classes of predictable protocols which characterize a majority
of such packets and implementing programmable hardware logic to process
such classes of protocols whereby individual datapackets are meaningfully
associated with stateful stream protocols so that significantly less
memory is required reducing power consumption, and increasing connection
establishment efficiency.
2. The method of claim 1 wherein said identifying comprises storing in a
memory a plurality of predetermined patterns which correspond to said
plurality of classes; analyzing the header of a packet to identify a
match with a stored pattern; simultaneously with said analyzing,
processing the header to determine whether the packet is valid;
controlling the programmable logic to process the packet in accordance
with the class corresponding to the matched pattern; and processing in
said software non-matching and invalid packets.
3. The method of claim I wherein said network protocol comprises TCP/IP.
4. In a network communications device, for data packets having a formatted
header containing information about the packet, said network
communications device comprising a header decoder, term addressable
memory, and stateful protocol processing units, the improvement
comprising means for matching a predetermined rurality of patterns of a
stream-oriented network protocols with fewer memory elements than
streams, thereby being more efficient with memory, which results in the
ability to support more streams with the same memory resource.
5. The improvement of claim 4 wherein said stream-oriented protocol
comprises TCP/IP, and wherein said processing units compromise
programmable logic controlled by a state machine, whereby TCP segments
can be identified statefully for processing into application memory,
thereby allowing higher level stream processing of session data spread
across multiple data packets.
Description
SCOPE OF THE INVENTION
[0001] In the above-identified application, there is described and claimed
a network accelerator and method for TCP/IP that includes programmable
logic for performing network protocol processing at network signaling
rates. The programmable logic is configured in a parallel pipelined a
architecture controlled by state machines and implements processing for
predictable patterns of the majority of transmissions. In more detail,
incoming packets are compared with patterns corresponding to classes of
transmissions which are stored in a content addressable memory and are
simultaneously stored in a dual port, dual bank application memory. The
patterns are used to determine sessions to which an incoming IP datagram
belongs, and data packets stored in the application memory are processed
by the programmable logic. Processing of packet headers is performed in
parallel and during memory transfer without the necessity of conventional
store and forward techniques resulting in a substantial reduction in
latency. Packets which constitute exceptions or which have checksum or
other errors are processed in software.
[0002] It has now been discovered that the above-described and claimed
accelerator and method has surprising improvement using an improved
content or term adressable memory called "VxCAM or VIRTUAL EXTENSIBLE
CONTENT ADDRESSABLE MEMORY". In accordance with the invention, VxCAM
matches the minimum number of predetermined plurality of patterns
resulting in fewer memory elements so that the invention can be easily
implemented on-chip, narrows path width and reduces connection
establishment overhead.
[0003] The present invention relates to Internet communications in
general, and to a method and system in particular for substantially
increasing the data throughput of TCP/EP protocol based data
transmissions by selectively implementing in hardware certain portions of
the TCP/IP protocol set (such as a majority of actually called and
executed routines), and implementing in software routines the exceptions
and remaining portions.
[0004] Since the implementation of FDDI fiber network links, the
transmission speed of the physical layer to transmit data, has exceeded
the ability of the end node computers to process the data packets. If the
processing of the data packets is done by Von Neuman architectured end
node computers, capacity is always exceeded since the switching speed of
the fastest computer's gates will be approximately equal to that of the
physical layer comprising the internal components of Application Specific
Integrated Circuit (ASIC) chips. The computer CPU (which must process the
data packets with multiple operations and copies to memory) intrinsically
requires orders of magnitude more device operations than that of the
analog/state machine mediated physical layer of the ASIC chips normalized
to a common amount of data. While the problem of scaling current computer
networks to gigabit speeds has been recognized, the complexity of the
TCP/IP protocols has presented both practical and conceptual barriers to
attempts to implement them in any manner other than various forms of
software executed processes. However, even the fastest of CPUs for any
given technological generation, cannot match the physical bandwidth of
their internal components.
[0005] There have been a number of attempts to accelerate TCP/IP protocol
handling, but none has effectively solved the latency problems. One
approach to accelerate TCP/IP protocol handling was to process the
headers of the protocols independently of the data payload. While the
implementation of the protocols themselves was virtually identical to
existing methods (TCP/IP software stack), the data was indirectly
manipulated by separate buffering to avoid multiple copies of the payload
data through the use of hardware buffer management using a multi-port
memory. This approach demonstrated that hardware buffer management could
improve handling of large payload packets, but it did not reduce packet
latency to memory, did not improve the control bandwidth of the protocol
or the ability to send small packets efficiently, and did not decouple
protocol processing speed from transmission speed. The approach also was
not applicable to local clusters, or to small record applications like
web-serving or transaction processing. Moreover, the approach did not
eliminate the store/forward processing of protocols, but merely attempted
to optimize the methods by which the store and forward were mediated.
[0006] ATM cell-based transmission technology incurs a cost because of
segmentation and reassembly of large data payload messages into much
smaller cells. Devices which attempt to minimize this cost perform this
function at the signaling rate. However, this function is specific to
cell-based technologies, and is not particularly useful for technologies
such as Ethernet and HiPPI. The payload size of such technologies'
packets do not require an adaptation layer below that of the network or
IP (Internet Protocol) layer. In order to process TCP/IP protocols,
traditional store and forward methods must be used.
[0007] Protocol engines have also been used to optimize traditional
methods of protocol handling to reduce certain steps. These include
hardware checksum units, hardware buffer management, and RISC processing
to improve protocol handling rate. However, this approach still does not
scale with signaling rate.
[0008] Other approaches have implemented in hardware proprietary
non-TCP/IP protocols having a continuous flow and routing that is
specific to the particular network fabric. Variable context matching is
not performed, and cells propagate in strict format and order to a priori
known memory addresses instead of to a transport protocol's abstract port
destination. Therefore, such approaches are not readily adaptable to wide
area networks which must handle a variable and relatively unstructured
traffic flow, and which must be scaleable, expandable and readily
adaptable to network changes.
[0009] It is desirable to provide a network accelerator system and method
for handling standard TCP/IP protocol which solves the latency and other
problems of known systems and methods, and it is to these ends that the
present invention is directed.
SUMMARY OF THE INVENTION
[0010] The present invention provides a solution to the above-mentioned
protocol processing problems using a cross disciplinary combination of
hardware elements, techniques and results based, inter alia, on network
traffic analysis, high speed programmable logic array technology, and
integration with low level operating system software design.
[0011] The invention solves a problem that has been long unsolved of how
to process TCP/IP data packets at a speed equal to that made possible by
the latest generation physical layer hardware transmission components. As
microprocessors increase in speed, the same technology advances also
increase the speed at which data can be transmitted over networks. If
this data protocol handling must be handled in software, then there are
fundamental issues in logic and software design that will always make the
ability of a processor to process the packets slower than the physical
ability of the network to transmit packets. This speed differential can
penalize maximum possible network performance by a factor of almost one
hundred at present.
[0012] The main insights that enable the invention to provide a practical
and implementable solution to the above-mentioned protocol processing
problems are the recognition that the transmission patterns of the vast
majority of packets over current TCP/IP mediated networks are predictable
and involve only a very small subset of the entire TCP/IP protocol set.
It is possible through logic design to implement this small set of
actually used protocols in hardware, such as programmable logic gate
arrays, to allow processing of TCP/IP data packets at speeds equal to
that of the ability of the fastest physical network layer. The rare
packets that cannot be handled in this manner can be defaulted to
conventional software processing. An operating system also can be
low-level interfaced to this processing system through appropriate memory
management in such a way that the packet's data coming off the network
data transmission medium can be processed and put into application memory
at the speed equivalent to a single gate-mediated operation.
[0013] The invention allows practical processing of TCP/IP data packets in
gate array hardware at a data throughput equal to that of the physical
transmission media. It accomplishes this task by recognizing that TCP/IP
packets on current networks fall into predictable transmission patterns
that actually utilize only a small fraction of the entire protocol for
the vast majority of transmissions. By implementing this small subset in
gate array hardware and defaulting the exceptions into software, a very
large increase in TCP/IP packet throughput can be obtained.
[0014] TCP/IP transmissions handled by the invention can be made faster
than that possible with the best current software implementations and
multiprocessor TCP/IP processing engines. Using mask programmable logic
affords approaches which are both faster and less expensive to construct
than the current RISC CPU assisted TCP/IP processing boards, the
invention is intrinsically scaleable upwards in speed with little or no
redesign needed as advances in IC processing technology makes the network
physical layers faster. A form of software embedded in hardware which can
be physically implemented at any point where TCP/IP packet processing is
used such as in network interface cards, and within microprocessor CPUs,
affording significant potential technological and economic benefits.
[0015] A difference between the invention and prior approaches is that the
invention constructs a path into memory for a specific class of packets
that exists for the likely time interval when such a packet will be
present. The path into and out of memory is handled entirely in the
hardware of the invention with only random logic up to where it interacts
with the application, and is triggered entirely by the arrival of the
packet itself. In this hardware, all details are present for handling the
packet payload state to where it will be delivered. With accelerators on
both ends of a network transfer, no software overhead need be present for
bulk data transfer in burst mode. This differs markedly from prior
software and hardware approaches which employed techniques of minimized
protocol implementations, buffer management, or by spreading the protocol
implementation across a specially designed network fabric.
[0016] The invention implements continuous flow (streamed) information
delivery via a standard protocol such as (TCP/IP) by means of a pattern
match via associative memory. It has several benefits in processing
standard protocols, as opposed to non-standard protocols. These include
absolute minimum latency between application and network medium (fiber),
absolute maximum bandwidth between communicating network applications,
low complexity design network protocol processing mechanism, and the
protocol rate scales linearly with network signaling rate.
[0017] These and other benefits are obtained, in one aspect, by avoiding
software and hardware processing steps via an isochronous
"stimulus/response" architecture using a variable content addressable
memory that has preprogrammed state logic that effects protocol
processing as a minimum time series of operations. A substantial, e.g.,
ten-fold, improvement in interapplication bandwidth with same complexity
hardware results which makes practical low-cost gigabit network transport
communications. While standard protocol processing is not unique as a
process, this inventive method of processing is unique in that the
software of a protocol implementation processes protocol information
indirectly via hardware which has been a priori instructed on how to
handle a predicted flow of packets autonomously. This methodology is
superior to prior attempts in that the transmission speed of the network
transport layer is scaled with the network physical layer.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] FIG. 1 is a diagram of a network accelerator in accordance with the
invention;
[0019] FIGS. 2a and 2b are diagrammatic views which contrast,
respectively, a traditional link data stream store and forward approach
with a network accelerated continuous flow link data stream approach in
accordance with the invention;
[0020] FIG. 3 is a more detailed diagram of the network accelerator of
FIG. 1;
[0021] FIG. 4 is a diagram of a control unit of the accelerator of FIG. 3;
[0022] FIG. 5 is a diagram of a transmit engine of the network acelerator;
[0023] FIG. 6 is a diagram of a receive engine of the invention;
[0024] FIGS. 7 and 8 are diagrams contrasting a traditional memory process
with that of the invention and
[0025] FIG. 9 is a diagram of the term addressable memory of the
invention.
DESCRIPTION OF PREFERRED EMBODIMENTS
[0026] The invention is particularly adaptable to TCP/IP protocols and
will be described in that context. It will be appreciated, however, that
the invention has greater utility and is applicable to other streamed
protocols.
[0027] Computer networks use network software protocols to communicate
information reliably between computers over multiple successive physical
signaling mediums. These protocols are implemented in software on
computer processors. While hardware signaling rates have steadily
increased, software protocol processing has not kept pace. With the
advent of gigabit networking technology, costly processors must be
dedicated to providing at most 40-50 percent of the theoretical bandwidth
of the network, while software implementations used with earlier
signaling technologies were capable of 60-80 percent of theoretical
bandwidth. Clearly bandwidth demands will continue to increase, and since
the disparity between software protocol processing and signaling rates
will also increase, this defines a "bottleneck" in the effectiveness of
networking technology.
[0028] The invention affords a minimum time mechanism for handling TCP
"burst" transfers. A burst transfer is a series of bulk data transfers
with no options between nodes (usually it is unidirectional). It consists
of the sending node passing data payload packets with successive sequence
numbers, and the receiver committing them to memory and sending
acknowledgments back to the sender to trigger more data to be sent. The
invention efficiently
handles the burst so as to minimize latency. The
software triggers the burst mode by a traditional send operation with a
full-payload, and the receiver/transmitter fall into an asynchronous
feedback loop of "send-next"/"acknowledgment" packets that continues
until either the burst completes or an error occurs.
[0029] The invention provides a mechanism to process the costly portions
of standard protocols in hardware entirely, and to do so at the same
clock rate of the signaling. In this way, as the signaling rate rises, so
does the protocol processing rate increase in lockstep. This approach is
based upon several observations, including traffic pattern analysis of
packets, experience with software protocol implementations, and
experience with other nonstandard hardware-implemented protocols.
[0030] Traffic observation of TCP/IP packets shows that the majority of
the packets simply pass bulk data without event, while the minority
packets require more elaborate handling. Even more significant is that
the delays on the Internet are for these very bulk data packets, so that
it is critical to have timely delivery or low-latency of these packets
for the performance to be maximized. Loss of this low-latency also
impacts the reliability of a network, since it becomes impossible to tell
if a failure has occurred, or if an assemblage of worst-case delays has
masked an otherwise successful transfer. The ability to handle protocol
packets with deterministic response time (as well as with a broad range
of arrival distributions) is a requirement to maintain the "real time"
characteristics that telecommunications services like telephone systems
use to provide high-valued services en masse globally.
[0031] Experiences with software protocol implementations have shown that
the necessary operations for a TCP/IP "burst" mode are constrained enough
to be performed by hardware as clocked by the data stream. Unfortunately
the difficulty in synchronizing the software with the data stream renders
this observation useless. However, significant performance advantage can
be gained by relying on hardware logic gate delays instead of program
instructions for substantially reducing the latency between the network
and application, thus allowing protocol handling at sustained rates
without the need for additional buffering. This allows for continuous
protocol processing at the data rate of the signaling technology. As will
be described, the network accelerator of the invention uses a
deterministic state machine to implement the transport protocol bulk
receive and transmit functions, leaving to the software all other
features of the protocol (including error recovery).
[0032] FIG. 1 is a functional diagram of a network protocol accelerator 10
in accordance with the invention. As will be described, the network
accelerator can perform 100 baseTx full-duplex interface to an Ethernet
network, with media access controller (MAC functions), IP (Internet
Protocol) processing and decoding, and TCP (Transmission Control
Protocol) processing. It may be a PCI interface board, designed to be
used in an NT workstation, for example, utilizing a standard PCI bus
slot. The accelerator will preferably have a physical link layer
processor, an IP processor, a TCP processor, segment buffer memory,
multiple FPGAs for logic, and a PCI interface to the host system.
[0033] As shown in the figure, the network accelerator includes a network
interface 12 that includes a physical (PHY) media framing unit which
obtains physical signals from the physical media, decodes the signals and
the link layer framing as a byte stream, and supplies the stream to an
accelerator engine 14. Simultaneously, a copy of the signals may be
recorded in a receive/transmit (Rx/Tx) FIFO bypass unit 16. In the event
of a failure of the accelerator engine 14 to accept the packet, a system
bus interface unit 18 may signal a system interface unit 20 to handle the
packet which is stored in the Rx FIFO bypass portion of 16. Similarly,
the system interface can send packets via the bus interface to the Rx/Tx
FIFO bypass portion of unit 16, which hands it on to the physical media
framing unit 12, effectively bypassing the accelerator engine for non-TCP
data transfers. The accelerator engine 14 is connected to a variable
content addressable memory 22, and consults the memory as octets of a
packet are received to find a match with predetermined patterns. When a
match is found, a state machine associated with the pattern is loaded
from the content addressable memory into the accelerator engine to
operate on the packet. Operations may include packet table delivery from
the physical media framing unit to a dual port application transfer
buffer 24; sending a packet with payload from the application transfer
buffer to the physical media forming unit; and sending a packet
acknowledgment to the physical media framing unit.
[0034] Upon completion of an operation, the accelerator engine 14 may
signal its status to the system interface unit 20 via the bus interface
18 indicating the event. In the event the accelerator engine fails to
recognize a packet or encounters an error, such as a CRC (cyclic
redundancy check) or checksum error, the accelerator engine suspends
operation on that segment causing all packet traffic to be handled via
the Rx/Tx FIFO bypass buffer until it is re-enabled by the system
interface unit 20 via the bus interface 18 to return to normal operation.
[0035] Preferably, the network accelerator of the invention is implemented
as a set of programmable logic chips intermediate between the network
physical layer interface chip set and an interface chip set for the PCI
or other bus. These programmable logic chips may be SRAM field
programmable gate arrays (FPGAs) which can be reprogrammed via the
network to allow the hardware protocols to be modified after installation
to correct errors, to optimize performance as the network changes, or to
implement changes to existing protocol sets. Low cost implementations can
also use mask programmable chip sets. The speed advantages of mask
programmable ASICs may make their use preferable in high speed point to
point data transfer applications where the end nodes and routers are well
defined by the user.
[0036] The network accelerator also may be implemented within the silicon
of the main microprocessor or within on-board multi-chip-modules
analogous to the way MMX incorporates digital signal processor
functionality, or by which AGP provides on-chip integration of graphics
accelerator functions. This can bring the data directly into the
microprocessor, bypassing the external data bus interface which otherwise
limits performance.
[0037] The high speed data transmission capability of the invention is
advantageous in providing for direct data storage, display, or data
processing device interconnect both between and within individual
computers.
[0038] FIGS. 2a and 2b are diagrammatic views which contrast the
significant improvement in latency between a network accelerated
continuous flow link data stream approach of the invention (FIG. 2b) and
a traditional link data stream store and forward approach (FIG. 2a). FIG.
2 illustrates how a traditional protocol stack accumulates data in a
store and forward buffer 30, and then performs the necessary protocol
processing operations. As indicated in the figure, data packets from an
Ethernet are delivered to a link data delivery unit which may perform
error checking prior to storing in buffer 30. The time required for this
operation is of the order of tens of microseconds. Subsequently, the
various segments of the data packet are processed in a protocol processor
32. This processing is sequential, and results may also be placed in the
store and forward buffer as application payload data is delivered to the
protocol processor. This typically may take hundreds of microseconds,
even with very high speed devices performing the operations. The critical
weakness is that data must be in some kind of buffer before it can be
processed, and processing must be completed before the data can be
forwarded.
[0039] In contrast, as shown in FIG. 2b, the network accelerator of the
invention uses the protocol's data stream itself as a way of instructing
a uniquely constructed data flow processing machine 34 that is clocked by
the protocol data and which performs processing operations as the
information appears. As indicated in the figure, processing occurs in a
series of parallel functional units 35-38 having a pipelined architecture
so that packets are processed in real time with the processed data
flowing between the network's wire link and the processing application's
data origin. In effect, a protocol's packet would appear as a single, fat
instruction that would run on a data flow processor in lock step with the
link's data rate. This allows complete processing in times of the order
of tens of microseconds in contrast to the traditional store and forward
approach illustrated in FIG. 2a. The manner in which this is accomplished
will be described in more detail below.
[0040] FIG. 3 is a functional diagram which illustrates in more detail a
preferred embodiment of the network accelerator of FIG. 1. As shown, the
physical media framing network interface 12 may comprise a physical
device interface (PHY) 40 connected to a CRC/MAC unit 42. This physical
device interface and CRC/MAC unit provide physical and link layer access,
respectively, to an Ethernet network. The CRC/MAC unit provides
parallel-to-serial conversion, CRC (cyclic redundancy check) generation
and checking, MAC address recognition, FIFO buffering, and interface to
the remainder of the network accelerator, which includes TCP/IP
processors and the dual port application transfer memory 24, which
preferably comprises a dual port/double banked RAM.
[0041] As will be described more fully, outgoing Ethernet packets will be
read from the buffer memory 24 and transferred to an internal FIFO in
preparation for transmission to the network. As the Ethernet packet is
constructed and output to the network from the FIFO by the Tx engine, the
CRC will be calculated on the fly and appended to the end of the Ethernet
packet. An incoming Ethernet packet is stored in the incoming FIFO while
the destination address is checked against the MAC address register. If
the MAC address is correct, the Ethernet packet is sent to an Rx engine.
The Ethernet packet is also run through the CRC checker, simultaneously.
Once the Ethernet packet is completely received and the CRC is good, the
CRC good signal will be asserted.
[0042] As shown in FIG. 3, the accelerator engine 14 includes a control
unit 44, a Tx engine 46, and a Rx engine 48. Included within the variable
content addressable memory 22 is a first prototype memory 50 connected to
the Tx engine 46, and a second prototype memory 52 connected to the Rx
engine 48. In addition, variable content addressable memory (VxCAM) 22
includes a content addressable memory (CAM) 54 which is also connected to
the Rx engine 48. The variable content addressable memory matches a
variety of packet formats and is used to quickly determine to which
session an incoming IP datagram belongs. The variable content addressable
memory 22 also includes an ADE memory 56 which is connected to control
unit 44. The Rx/Tx FIFO bypass memory 16 may be implemented as a Tx
bypass memory 60 and a Rx bypass memory 62. As shown, bypass memory 62
may be connected to the Rx engine 48 and to a bus 64 connecting the
control unit 44 and bus interface unit 18. The Tx bypass memory 60 may be
similarly connected to bus 64 and to the CRC/MAC unit 42.
[0043] The network accelerator
handles the various layers of an Ethernet
packet as it is sent or received from a network. When processing the IP
layer, the IP address, IP checksums, ID field, flags, IP datagram length,
etc. are either pre-calculated and sent to the network via the MTx engine
46, or used to verify the destination of an incoming Ethernet packet via
the Rx engine 48.
[0044] When processing the TCP layer, TCP ports, TCP checksums, sequence
numbers, ACK number, flags, window size, urgent pointer, options, etc.
are either pre-calculated and sent to the network via the Tx engine, or
used to verify the destination of a incoming IP datagram via the Rx
engine. The Tx engine 46 obtains the TCP payload directly from the memory
24. The Rx engine 48 delivers the TCP payload directly to the memory 24.
[0045] The memory 24 will contain the host system view of network memory,
and a shadowed copy for the network accelerator to use for TCP segment
transmission and reception. The host system software driver will swap
application memory (system RAM) for memory 24. This will allow the host
system direct access the network data stored in the dual-port/double
banked memory, effectively replacing the role of host system RAM.
Finally, the system interface controls the relationship between the
system and the network accelerator. It contains configuration and status
registers, and allows the host system to access the network accelerator.
[0046] Data for the packets are buffered for transfer using the memory 24.
This memory maintains an up-to-date copy of the network data for the host
system/application, and a local copy of the network data for the network
accelerator. This allows the application/host system to access memory as
it would system RAM before, during and after a TCP segment is sent to the
network by the network accelerator. Also, the memory allows access to a
stable copy of the network data for transmission or reception to/from the
network. The network acceleration control unit maintains the proper
relationship between the memory banks, with the banks synchronized in the
case of Idle state (the network accelerator is neither transmitting nor
receiving TCP segments), or logically separated during network
accelerator TCP segment transmission or reception. The double banked
nature of the memory allows a "zero-copy" or "zero-latency" method of
network data delivery to the network accelerator.
[0047] Along with the bulk memory, there is status memory used to maintain
the relationship between the memory bank and the host system memory bank.
This status memory works as a table indicating which bank of memory has
the most current byte of network data for each address in the memory.
[0048] The content addressable memory (CAM) 54 is used to quickly
determine to which session an incoming IP datagram belongs. It cooperates
with ADE memory 56, prototype memories 50, 52, and is part of the
variable content addressable memory (VxCAM) 22.
[0049] Within the ADE memory 56, there will be one or more address
descriptor entries (ADEs) which describe the segment details such as
memory base address, TCP payload length, TCP payload checksum, next TCP
sequence number and the next TCP segment's ADE. This information is used
by the Tx engine when the segment is constructed, prior to transmission.
The Rx engine uses the ADE fields to determine the sequence numbers,
payload destination, and out-of order segments.
[0050] Within the prototype memories 50, 52, there will be one or more
session prototype description entries. These entries describe the session
fields that do not change, as well as the initial values for the session,
such as IP address, TCP ports, protocol fields, base sequence number,
first ADE, etc. The Tx engine uses this information to generate the
static fields within a session for an outgoing TCP segment. The Rx engine
uses this information to determine what TCP session an incoming TCP
segment is destined for, and to verify the validity of specific fields in
the TCP/IP header.
[0051] The content addressable memory 22 stores the address of the
potential TCP session prototype entry that describes the session to which
an incoming segment belongs. Certain fields in the TCP/IP header are
hashed to obtain a value which is used as an address to "look up" which
prototype describes this segment. The memory stores at the "hashed"
address another address which points to the prototype data in the
prototype memory. If the memory returns a value of zero, the incoming TCP
segment does not belong to any accelerated sessions and is routed to the
bypass FIFO. In this manner, a "one s
hot" lookup of the TCP session
prototype can be done, rather than searching potentially thousands of TCP
session prototypes.
[0052] FIG. 4 illustrates the network accelerator control unit 44 in more
detail. As indicated above, the control unit provides the overall state
machines and control registers which control the network accelerator.
Logic for controlling the dual port application transfer memory 24 and
the Tx and Rx session state machines (to be described) for the Tx engine
and the Rx engine, respectively, may be contained in a dual port memory
controller 61. Logic for generating a checksum may be contained in a
checksum unit 62 which interfaces with ADE memory 56 via an address bus
57 and a data bus 58. After initialization of a current checksum, the ADE
memory 56 may be created and used for bounds checking on the host address
to obtain the checksum for the desired payload of a TCP segment. This
checksum may be loaded into the checksum unit 62. The current value may
be stored in memory 24, and a new calculated value may be added to the
checksum. The checksum may then be saved either through a write back to
the ADE memory 56 and the dual port application transfer memory 24, or if
multiple locations require modifications, by iteration. The calculated
checksum is then ready for use as the data checksum of the TCP segment.
[0053] As shown in FIG. 4, a first FIFO buffer 70 may interface the dual
port application transfer memory 24 to Tx data from the Tx engine state
machine, and a second FIFO buffer 72 may interface memory 24 to Rx data
from the Rx engine state machine. Logic for controlling the FIFOs may be
contained in the FIFO buffers themselves and used to minimize bus
arbitration read/write by an arbitration unit 74. In addition, logic for
controlling the Rx engine and the Tx engine, as well as access to their
status and control registers, may be contained in a registers and
configuration unit 76 which is interfaced to memory 24 and memory
controller 61 by an application data bus 77 and an application address
bus 78. Arbitration unit 74 also may include logic to control memory
access arbitration between the host system and the network accelerator.
The network accelerator control unit also maintains global control of the
state machines for each session.
[0054] The Tx state machine and the Rx state machine may have the
following states. Tx idle is the state prior to sending a Tx buffer to
the network. This is the default state and is set up by the software
driver. The software driver will also generate the necessary values for
the VxCAM for a given buffer space. The host system fills the memory
until the session is ready to be transmitted. At this point, the state
machine transitions to the Tx pending state.
[0055] During the Tx pending state, the dual port memory controller 61
maintains two copies of the data: one for the host system, and one for
the Tx engine. Proper data relationship between the host system memory
and the network accelerator memory must be maintained to prevent old data
from overwriting new host system data, and new host system data from
overwriting the data in use by the Tx engine.
[0056] In the Tx complete state, if the transmission fails the state
machine goes to the Tx re-transmit. If the Tx was a success, the network
accelerator will set a success bit and go to the Tx idle state. The
network accelerator is now waiting to send out the next segment. In
either case, the network accelerator control unit must continue to
maintain the proper relationship between the two copies of data.
[0057] If the Tx transaction fails, in the Tx re-transmit state, the
network accelerator may either attempt to re-transmit the segment, or
move to the next session queued for transmission and attempt this segment
later.
[0058] The Rx idle state is the initial state. In this state, the two
copies of network data are reconciled. Depending on the outcome of the
previous received segments, the host system reads data from either the
shadow bank of dual port application transfer memory or the application
bank of same memory. If a packet was successfully received, the net
payload data stored in the shadow bank of the dual port application
transfer memory must be presented to the host system. This is performed
on a byte by byte level.
[0059] In the Rx pending state, the Rx engine is receiving one or more
segments is the current session. Receive data is placed in the proper
bank of the memory by the network accelerator control unit.
[0060] In the Rx complete state, there may be two different scenarios: Rx
success or Rx time-out. In the case of success, the success bit for the
ADE is set, then the Rx idle state is entered. In the case of failure,
the state machine goes to idle and no changes occur to the memory.
[0061] Checksums for the payload of the TCP packet are calculated by the
checksum logic 62 as follows. Upon initial setup of the session, the
section of memory used by the session is cleared to all zeros. This
allows the initial checksum to be initialized to zero for each segment.
ADEs are setup for each segment within the session; ADEs contain the
starting address, ending address, and checksum for each segment of the
session. There may be one or more segments in any session.
[0062] During host system writes, the host system presents an address to
be accessed. Bounds checking is performed on this address to determine
which ADE contains the checksum for this address. The checksum is loaded
into the checksum logic and the current (old) value in the memory is
subtracted from the checksum. Next, the new data value is added to the
checksum.
[0063] Upon saving a new checksum, if it is a single location write, the
new checksum is written back into the ADE and the new data value is
written into the memory. If multiple locations are to be modified, the
checksum stays in the checksum generator and each new data value is added
to the checksum while each old value is subtracted. The new data is
written into the dual port application transfer memory 24 during this
operation.
[0064] Once a segment is ready to be sent to the network, the Tx engine
uses the checksum stored in the ADE as the checksum for the data portion
of the TCP segment.
[0065] The Rx engine and the Tx engine use FIFOs 70, 72 for interfacing
the engines to the dual port application transfer memory 24. The FIFOs
minimize the bus arbitration necessary to read and write data into the
dual port application transfer memory 24 from the engines. The control of
the FIFOs involves filling and draining the FIFOs in a cycle-steal mode
between host system accesses to the memory.
[0066] The control unit 44 has an address and data bus connection to the
Tx and Rx engines 46, 48. This bus allows the control unit to set and
read configuration and status registers within the two engines.
[0067] The control unit controls access to the Tx and Rx engines and all
memory 24, 50, 52, 56, and CAM memory 54 through arbitration, using
arbitration unit 74. Host system accesses and accesses compete for access
through the control unit. Any known arbitration method may be used to
control these accesses.
[0068] FIGS. 5 and 6, respectively, illustrate in more detail preferred
embodiments of the Tx (transmit) engine 46 and Rx (receive) engine 48.
Referring to FIG. 5, Tx engine 46 may be controlled by a state machine
100, which is used to generate signals which are used to control all the
events in the send process. It may be based on a send counter (not
shown). This counter is started at initial transmission time, and
generates signals which are used to control all the events in the send
process. A multiplexer 102 combines Tx data and the outputs of several
registers, and provides these to an output register 104.
[0069] The registers muxed to the output register 104 may be Tx prototype
register 106, a Tx application data output register 108, the outputs of
checksum registers 110 and 112, an ACK register, and all the individually
calculated fields in overlay registers 114 and 116.
[0070] The send counter and the Tx engine control state machine 100 govern
the timeslots for outputting the various fields to the output register.
The state machine determines the proper time to calculate the various IP
and TCP fields and when to send next segment. Sliding window calculation
logic provides the information via register 120 to the Tx engine state
machine for next segment transmission.
[0071] The Tx engine is responsible for sending Ethernet packets
containing IP datagrams of TCP segments to the network. There are two
primary types of TCP segments. These are user data (ADE) segments, and
automatically generated acknowledgment segments for received data. The
network accelerator creates packets from scratch, generates the Ethernet
header, the IP header, the TCP header, and the TCP data payload.
[0072] The Tx engine state machine 100, which may be contained in the dual
port memory controller 61, asserts the Tx pending state through the Tx
engine control state machine 100, making available data contained in the
Tx FIFO 70 (FIG. 4). The Tx engine loads a prototype register 106 with
static portions of the TCP/IP headers from the proto memory 50 of the Tx
engine (FIG. 3). The logic for the calculation of the dynamic portion of
TCP header is contained in a TCP header Tx overlay register 116. The
logic for the final checksum calculation for the dynamic portion of the
IP header may be contained in an IP header checksum register 10 of FIG.
5, and the logic for the post-checksum calculation for the TCP segment
may be contained in a TCP segment post-checksum register 112. The logic
which provides sequential accesses to the register contents to the output
register 104 may be contained in a transfer register 105. The logic for
calculating sequence numbers from the TCP header Tx overlay register by
adding the length of the packet data contents to a current sequence
number may be contained in an arithmetic logic unit (ALU) 1 18. The logic
to obtain values of a prior received datagram's sequence number and
length to generate an acknowledgment number using the ALU may be obtained
from the Rx engine TCP header Rx register 103. The final results may be
output via the output register 104 to the CRC/MAC unit 42 of FIG. 3. The
logic for determining whether sending of a datagram was successful and
acknowledged may be contained in the engine control state machine 100.
The logic for determining if one can send additional datagrams is
determined by the engine control state machine 100 and Tx window register
120.
[0073] The data used to generate the dynamic calculated portion of the
TCP/IP headers reside in the ADE memory, the proto memory, and the memory
24, and the data to generate the static precalculated portion of the
TCP/IP headers resides in the Tx engine proto memory 50. The data used to
generate the TCP/IP payload resides in the dual port application memory
24.
[0074] When the host system asserts a signal indicating that a segment
should be sent to the network, the base address of the segment prototype
is loaded into the Tx engine proto memory address register, and the base
ADE address for the segment is loaded into the Tx engine ADE memory
address register. The Tx engine reads the ADE and prototype data out of
the ADE memory and the Tx engine proto memory, respectively, then
calculates the various fields and inserts the fields into the outgoing
network stream. Certain fields of the stream, such as sequence numbers,
ACK numbers, ID fields, etc., may be calculated as the stream progresses.
Once the headers have been calculated, the TCP payload is output from the
dual port application transfer memory. Finally, the CRC/MAC unit 42
appends a CRC 32 value to the Ethernet packet, and completes delivery of
the packet to the PHY device 40. In this manner, the network accelerator
generates a complete Ethernet packet comprising an IP datagram containing
a TCP segment.
[0075] Referring to FIG. 6, the Rx engine 48 is controlled by an Rx engine
control state machine 140, and receives Ethernet packets from the network
interface comprising the PHY device 40 and the CRC/MAC unit 42 via the
input register 142. Upon receipt, the state machine sequences data to
other elements of the Rx engine. The receive packet is sent to the Rx
bypass memory 62, which serves as a buffer and used for any packet that
is not a bulk data transfer TCP segment. The Rx engine processes the IP
and TCP headers and determines the type of TCP segment. The Vx CAM memory
22 is used by the Rx engine to determine to which session an incoming IP
datagram belongs.
[0076] The Rx engine, under the control of state machine 140, compares a
number of fields of the IP header with expected values stored in a
plurality of registers. Certain fields in the TCP/IP header are static
and can be compared against static values. Other fields are variable, and
define, for example, the length, or checksum or other session-related
details. The variable fields are compared against values stored in
registers and pre-determined values stored in the ADE memory.
[0077] Upon receiving an incoming packet, the header is decoded by a
decoder 144 to determine the location of the source and destination
addresses and ports contained in the TCP/IP header. The logic for
locating the associated prototype packet header and address descripted
entry is contained a Vx CAM proto-ADE locator 146. The Rx engine block
address decoded entry may be held in an ADE register 156. The Rx engine
block prototype entry may be held in proto memory 52, which loads the
entry into a prototype register 148. A TCP/IP header matcher 150 which
contains logic for comparing of session fields of the packet obtained
from the prototype register and variable fields held in a TCP header Rx
register 152 and IP header Rx register 154. Logic for validating the
checksum for the IP portion of the TCP/IP header matcher 150 may be
contained in an IP header checksum unit 162, and the logic which
validates the checksum for the TCP segment portion of the TCP/IP header
matcher and the data stream from the input register 142 may be contained
in a TCP segment header checksum unit 160. Data from valid packets may be
passed to the receive data FIFO 72 (FIG. 4). Logic for updating TCP
header Rx register 152 for transmitted data acknowledgments or buffer
window size adjustment may be contained in the arithmetic logic unit
(ALU) 164.
[0078] The Rx engine control state machine 140 reduces the Rx window
register 170 as data is received, and increases it as buffer space
becomes available in the dual port application transfer memory 24 (FIG.
3) by the application. Under the control of the state machine, the
contents of the Rx window register 170 may also be passed to the ALU 164
to synthesize a window update, which may be passed to the Tx engine via
the Rx engine transfer unit 172.
[0079] When processing a packet header, if any of the fields of the header
do not match expected values, the segment may be routed to the Rx bypass
memory 62, and the Rx engine may go into an idle state. The IP source and
destination addresses, plus the TCP source and destination ports, may be
hashed together to form a value which is used as an address to look up in
the content addressable memory the address for the Rx prototype. If the
memory returns a non-zero value, it is used as an address to fetch the Rx
prototype. If the value is zero, the packet is routed to the bypass
buffer.
[0080] The value returned by the content addressable memory is used as the
base address for the Rx prototype for the segment. The prototype is read
and the IP address and the TCP ports are compared against prototype
values. If they match, the segment is accepted for further processing,
and the ADE base address is read from the prototype memory array. The ADE
contains the base sequence number of the memory region. If the sequence
number and the segment falls within those in the ADE, it is accepted and
the base TCP payload address is read from the ADE.
[0081] Data from the segment is read into the dual transport application
memory 24 until the segment is completely received, which can be
determined by a length counter. Once a segment is received, a CRC 32
signal may be asserted, indicating the packet has been verified and to
notify the host system of receipt of data. The Rx engine 48 remains in a
pending state until a finished bit is received for the segment. At that
time, the system is interrupted and the network accelerator control unit
goes into the Rx complete state.
[0082] From the foregoing, it may be seen that the network accelerator of
the invention affords significant advantages, and may be used in diverse
applications. It is also applicable to continuous flow, streamed
protocols other than TCP/IP. Some of these applications include high
speed links for network backbones, protocol processing for gigabit
physical Ethernet layers, data transport between computers within a
system, high speed transport for real time high resolution video,
increasing the speed of Internet data burst communications, permitting
telephony packets to be transmitted over the Internet, and affording
enhanced transaction processing and robotics control feedback.
[0083] As will also be appreciated from the foregoing, the implementation
of the network accelerator is not limited to FPGAs. It may be implemented
in other forms of hardware, and even integrated with microprocessors. The
invention may be installed in various components, such as disk drives,
graphics cards, video transmission devices, wireless links, TCP/IP hubs,
and the like. The substantial increase in speed and corresponding
reduction in latency afforded by the invention is a significant
advantage.
Overview
[0084] Internet communications require heavy use of packet traffic that is
directed between endpoints by large (32 to 97 bit) identifiers within the
packet. packet traffic is directed by partially decoding the full 97 bit
identifier. The smallest identifier (32-bit) direct raffic to a computer,
medium sized identifiers direct traffic to a specific computer's
application program instantiation (65-bit), while the largest identifier
(97-bit) identifies the communication session between two application
programs. Level-3 switches use he smallest identifier, but for level-4
switches and processor adapters, medium and arge identifiers are required
hundreds to thousands of these identifiers may be used in a fraction of a
second. If more identifiers can be matched more quickly, the same ardware
can handle more network bandwidth for improvement in performance.
[0085] Content-addressable memory (CAM) is a hardware concept commonly
used in switches to direct packets. The VxCAM, or Virtual context
dependant content addressable memory is an improvement of the CAM concept
that takes into account characteristics of the usage on the Internet to
perform more effectively. The VxCAM outperforms a CAM in a Level-4 device
by requiring (a) fewer memory elements to switch the same amount of
traffic, (b) less wide data paths, and (c) shorter connection
establishment--all the result of the fewer terms to check and setup.
[0086] This invention reduces the complexity of the process of associating
packets with specific information sessions or groups as explained above
such that higher-level protocol functions can be performed In accordance
with the invention, since the primary protocol used to communication over
the Internet is TCP, over 95% of all communications is TCP. A single web
page on the average has 10 connection sessions that each send on the
average 10K bytes of payload for a total of 100-300 packets. The 97 bit
endpoint to endpoint Internet identifier can be broken into separate
components as follows:
[0087] (1.) 1 bit for UDP/TCP protocol selection, (2) two 32 bit IP source
and destination t terms that determine selection of the communicating
computers, and (3) two 16 bit source and destination UDP/TCP port terms
for application/session determination within each computer.
[0088] I have discovered that that 90% of the terms in all of the packet
identifiers in all of these packets are the same. Based on that
discovery, the technique of term sharing has surprising application
wherein one memory cell for the redundant terms that would in a
conventional CAM consume memory cells, has surprising efficiency.
[0089] In the invention, two kinds of identifiers are present--IP
addresses and ports. Small "term" CAM's of IP addresses and ports match
terms regardless of use as source or destination, or of another session.
Another benefit is to "compress" an address or port into fewer bits,
since the index of each of the term CAM's is smaller than the term width.
Furthermore in the invention, the combination of source/destination
address/port matches is matched against yet another small CAM of sessions
to in turn locate the index of the session descriptor. As a result, these
three small CAMs reduce a 97 bit by 1024 session CAM of 99,328 bits to a
fully allocated VxCAM of 61,440 bits.
[0090] Limits of the process in accordance with the invention: at least
one port per session is required as is at least one address per sets of
shared sessions. Hence, in average use the size of the VxCAM of 1024
sessions is greatly reduced--from 61,440 to 23,552 bits. As a result,
small memories can be implemented on chip instead of requiring larger off
chip memories with their attendant drawbacks, viz., delays, costs, etc.
DESCRIPTION OF EMBODIMENTS
[0091] As previously mentioned, content addressable memory (CAM) is a
hardware method in which incoming data is compared with a set of
predetermined patterns to identify a matching pattern, and has been
heretofore used by Internet communications devices to partially decode
large (32 to 97 bit) identifiers to direct the packet to its destination.
As As the number of sessions increases, the CAM required to compare and
match patterns increases linearly (see FIG. 7). Where off-chip memory is
used to process these patterns, inefficiency results.
[0092] VxCAM (Virtual Extensible Content Addressable Memory) of the
invention, matches a minimum number of predetermined plurality of
patterns, resulting in fewer memory elements (FIG. 8) required so that
the invention is easily implemented on-chip, narrows memory path width,
and reduces connection establishment overhead.
[0093] The method and system of the invention is shown in detail in FIG.
9.
[0094] As shown, as the destination address is latched in the TCP/IP
header register (200), the Session Accumulator (206) is cleared. The
destination address is gated onto the IP address bus (201) to an Address
Term CAM (203) which locates the destination address term. If not found,
the packet is signaled as not recognized and the VxCAM ignores all
further action. However, if found, the resultant index of the IP address
term is passed through the Adder/Mux (205) to the Session Accumulator
register (206). Similarly, the source address is gated onto the IP
address bus to an Address Term CAM (203) which locates the source address
term. If not found, the packet is signaled as not recognized and the
VXCAM ignores all further action. If found, the resultant index of the IP
address term is accumulated using the Adder/Mux (205) to the Session
Accumulator register (206). The destination port address is gated onto
the TCP/UDP port bus (202) to a Address Term CAM (204) which locates the
destination port address term. If not found, the packet is signaled as
not recognized and the VxCAM ignores all further action. If found, the
resultant index of the port destination address term is accumulated using
the Adder/Mux (205) to the Session Accumulator register (206). The source
port address, if present, is gated onto the TCP/UDP port bus (202) to an
Address Term CAM (204) which locales the source port address term. If
found, the resultant index of the port address term is accumulated using
the Adder/Mux (205) to the Session Accumulator register (206). The
contents of the Session Accumulator consisting of the term index of the
IP destination address, the term index of the IP source address, the term
index of the TCP/UDP destination port address, and the term index of the
TCP/UDP source port address if present, is passed to the Session CAM
(207) which locates the index of the session descriptor.
[0095] I have discovered that TCP/IP communications involve highly
redundant header fields that can be matched more efficiently by factoring
out the redundant entries.
[0096] While the foregoing description has been with reference to
particular embodiments, it will be appreciated by those skilled in the
art that changes in these embodiments made be made without departing from
the spirit of the invention, the scipe of which is defined in the
appended claims.
* * * * *