Register or Login To Download This Patent As A PDF
| United States Patent Application |
20020099855
|
| Kind Code
|
A1
|
|
BASS, BRIAN MITCHELL
;   et al.
|
July 25, 2002
|
NETWORK PROCESSOR, MEMORY ORGANIZATION AND METHODS
Abstract
A network switch apparatus, components for such an apparatus, and methods
of operating such an apparatus in which data flow handling and
flexibility is enhanced by the cooperation of a plurality of memory
elements and a plurality of interface processors formed on a
semiconductor substrate. The memory elements and interface processors
together form a network processor capable of cooperating with other
elements in executing instructions directing the flow of data in a
network. Access to the memory elements is controlled in a particular
manner and under operative rules which provide controlled multiple
accesses of the plurality of memory elements by the plurality of
processors.
| Inventors: |
BASS, BRIAN MITCHELL; (APEX, NC)
; HEDDES, MARCO C.; (RALEIGH, NC)
; PATEL, PIYUSH CHUNILAL; (CARY, NC)
; REVILLA, JUAN GUILLERMO; (AUSTIN, TX)
; SIEGEL, MICHAEL STEVEN; (RALEIGH, NC)
; VERPLANKEN, FABRICE JEAN; (CARY, NC)
|
| Correspondence Address:
|
IBM CORPORATION 972/B656
INTELLECTUAL PROPERTY LAW
P O BOX 12195
RES TRI PARK
NC
27709
|
| Serial No.:
|
384744 |
| Series Code:
|
09
|
| Filed:
|
August 27, 1999 |
| Current U.S. Class: |
709/249; 370/351 |
| Class at Publication: |
709/249; 370/351 |
| International Class: |
G06F 015/16; H04L 012/28 |
Claims
What is claimed is
1. Apparatus comprising: a semiconductor substrate; a plurality of
interface processors formed on said substrate, the number of said
processors being at least five; internal instruction memory formed on
said substrate and storing instructions accessibly to said interface
processors; internal data memory formed on said substrate and storing
data passing through said device accessibly to said interface processors;
said internal data memory having a plurality of memory elements, the
number of said memory elements being correlated to the number of said
interface processors; a request controller operatively interposed between
each of said interface processors and said memory elements and
controlling the flow of requests from data from said interface processor
to said memory elements; a memory arbiter operatively interposed between
each of said memory elements and said interface processors and
controlling access to said memory element by said interface processors;
each of said interface processors having access to a plurality of said
memory elements and each of said memory elements being responsive to
requests from a plurality of said interface processors; and a plurality
of input/output ports formed on said substrate; at least one of said
input/output ports connecting said internal data memory with external
data memory; and at least two other of said input/output ports exchanging
data passing through the interface device with an external network under
the direction of said interface processors.
2. Apparatus according to claim 1 wherein each of said memory arbiters
grants highest priority memory access to a read request from a single
interface processor.
3. Apparatus according to claim 2 wherein each of said memory arbiters
grants second highest priority memory access; to a plurality of
simultaneous read requests for common data from a plurality of interface
processors.
4. Apparatus according to claim 3 wherein each of said memory arbiters
grants third highest priority memory access to a plurality of
simultaneous write requests from a plurality of interface processors.
5. Apparatus according to claim 4 wherein each of said memory arbiters
grants fourth highest priority memory access to a plurality of
simultaneous read/modify/write requests from a plurality of interface
processors.
6. Apparatus comprising: a semiconductor substrate; at least five
interface processors formed on said substrate; internal instruction
memory formed on said substrate and storing instructions accessibly to
each of said interface processors; internal data memory formed on said
substrate and storing data passing through said device accessibly to each
of said interface processors; said internal data memory having at least
five memory elements; a request controller operatively interposed between
each of said interface processors and said memory elements and
controlling the flow of requests from data from the corresponding one of
said interface processors to said memory elements; a memory arbiter
operatively interposed between each of said memory elements and said
interface processors and controlling access to the corresponding one of
said memory elements by said interface processors; each of said interface
processors having access to a plurality of said memory elements and each
of said memory elements being responsive to requests from a plurality of
said interface processors; said request controllers and said memory
arbiters cooperating for implementing a set of access rules governing
access to said memory elements by said interface processors which grant
highest priority access to a single request from a single interface
processor to a single memory element; and a plurality of input/output
ports formed on said substrate; at least one of said input/output ports
connecting said internal data memory with external data memor/; and at
least two other of said input/output ports exchanging data passing
through the interface device with an external network under the direction
of said interface processors.
7. A memory system comprising: a plurality of storage modules; and a
memory controller operatively coupled to the plurality of storage
modules, said memory controller being responsive to multiple requests
from multiple sources to provide access simultaneously to all or some of
the storage modules.
8. The memory system as set forth in claim 7 wherein multiple requests are
received simultaneously.
9. The memory system of claims 7 or 8 further including a plurality of
processors operatively coupled to the memory controller and generating
the multiple requests.
10. The memory system of claim 9 wherein the memory controller includes: a
plurality of arbiters with each arbiter operatively coupled to one
storage module and controlling access to said one storage module; a
plurality of control devices with each one operatively coupled to receive
signals from one of the processors and to generate storage request
signals indicating access to one of the storage modules; and a bus
structure operatively coupling the control devices and the arbiters so
that each control device has access to all or some of the storage
modules.
11. The memory system of claim 10 further including a substrate upon which
recited elements are integrated.
12. The memory system of claim 11 wherein the memory modules are of
different types with each one being designed to store a particular type
of data structure.
13. The memory system of claim 12 wherein the memory modules include DRAMs
and SRAMs.
14. The memory system of claim 12 wherein a portion of said SRAMs and at
least one DRAM are fabricated on said substrate and at least one SRAM and
the remainder of any DRAMs are located externally of said substrate.
15. The memory system of claim 12 wherein selected ones of the memories
are partitioned into multiple sub-memories which can be accessed
simultaneously.
16. The memory system of claim 13 wherein the DRAMs are partitioned into
four banks.
17. A method comprising the steps of: receiving a data flow inbound
through an input port of an interface device; communicating the data flow
through a plurality of interface processors embedded in the interface
device; parsing the data flow into a plurality of portions; storing the
parsed portions of the data flow in a plurality of memory elements;
controlling access between the memory elements and the interface
processors to allow each of the interface processors access to each of
the memory elements; and assembling and directing data flow outbound
through an output port of the interface device in accordance with the
execution of stored instructions by the interface processors.
18. A method according to claim 17 further comprising storing selected
portions of the parsed data flow in the memory elements, and directing
other selected portions of the parsed data flow to a switching fabric for
determination of an outbound direction.
19. A method according to claim 18 further comprising recombining the
stored and other selected portions of the data flow prior to direction of
the data flow outbound through an output port.
Description
RELATED APPLICATIONS
[0001] The interested reader is referred, for assistance in understanding
the inventions here described, to the following prior disclosures which
are relevant to the description which follows and each of which is hereby
incorporated by reference into this description as fully as if here
repeated in full:
[0002] U.S. Pat. No. 5,008,878 issued Apr. 16, 1991 for High Speed Modular
Switching Apparatus for Circuit and Packet Switched Traffic;
[0003] U.S. Pat. No. 5,724,348 issued Mar. 3, 1998 for Efficient
Hardware/Software Interface for a Data Switch;
[0004] U.S. Pat. No, 5,787,430, issued Jul. 28, 1998 for Variable Length
Data Sequence Back Tracking and Tree Structure;
[0005] U.S. patent application Ser. No. 09/312,148 filed May 14, 1999, and
entitled "System Method and Computer Program for Filtering Using Tree
Structure"; and
[0006] U.S. patent application Ser. No., 09/330,968 filed Jun. 11, 1999
and entitled "High Speed Parallel/Serial Link for Data Communication".
BACKGROUND OF THE INVENTION
[0007] This invention relates to communication network apparatus such as
is used to link together information handling systems or computers of
various types and capabilities and to components of such apparatus. In
particular, this invention relates to scalable switch apparatus and
components useful in assembling such apparatus. This invention relates to
an improved and multi-functional interface device and the combination of
that device with other elements to provide a media, speed network switch.
The invention also relates to methods of operating such apparatus which
improve the data flow handling capability of
network switches.
[0008] The description which follows presupposes knowledge of network data
communications and switches and routers as used in such communications
networks. In particular, the description presupposes familiarity with the
ISO model of network architecture which divides network operation into
layers. A typical architecture based upon the ISO model extends from
Layer 1 (also sometimes identified as "L1") being the physical pathway or
media through which signals are passed upwards through Layers 2, 3, 4 and
so forth to Layer 7, the last mentioned being the layer of applications
programming running on a computer system linked to the network. In this
document, mention of L1, L2 and so forth is intended to refer to the
corresponding layer of a network architecture. The disclosure also
presupposes a fundamental understanding of bit strings known as packets
and frames in such network communication.
[0009] In today's networked world, bandwidth is a critical resource.
Increasing network traffic, driven by the Internet and other emerging
applications, is straining the capacity of network infrastructures. To
keep pace, organizations are looking for better technologies and
methodologies to support and manage traffic growth and the convergence of
voice with data.
[0010] Today's dramatic increase in network traffic can be attributed to
the popularity of the Internet, a growing need for remote access to
information, and emerging applications. The Internet alone, with its
explosive growth in e-commerce, has placed a sometimes insupportable load
on network backbones. It is also the single most important cause of
increased data traffic volumes that exceed voice traffic for the first
time. The growing demands of remote access applications, including
e-mail, database access, and file transfer, are further straining
networks.
[0011] The convergence of voice and data will play a large role in
defining tomorrow's network environment. Currently, the transmission of
data over Internet protocol (IP) networks is free. Because voice
communications will naturally follow the path of lowest cost, voice will
inevitably converge with data. Technologies such as Voice over IP (VoIP),
Voice over ATM (VoATM), and Voice over Frame Relay (VoFR) are
cost-effective alternatives in this changing market. However, to make
migration to these technologies possible, the industry has to ensure
quality of service (QoS) for voice and determine how to charge for voice
transfer over data lines. The Telecommunications Deregulation Act of 1996
further complicates this environment. This legislation will reinforce a
symbiotic relationship between the voice protocol of choice, ATM, and the
data protocol of choice, IP.
[0012] Integrating legacy systems is also a crucial concern for
organizations as new products and capabilities become available. To
preserve their investments in existing equipment and software,
organizations demand solutions that allow them to migrate to new
technologies without disrupting their current operations.
[0013] Eliminating network bottlenecks continues to be a top priority for
service providers. Routers are often the source of these bottlenecks.
However, network congestion in general is often misdiagnosed as a
bandwidth problem and is addressed by seeking higher-bandwidth solutions.
Today, manufacturers are recognizing this difficulty. They are turning to
network processor technologies to manage bandwidth resources more
efficiently and to provide the advanced data services, at wire speed,
that are commonly found in routers and network application servers. These
services include load balancing, QoS, gateways, fire walls, security, and
web caching.
[0014] For remote access applications, performance, bandwidth-on-demand,
security, and authentication rank as top priorities. The demand for
integration of QoS and CoS, integrated voice handling, and more
sophisticated security solutions will also shape the designs of future
remote access
network switches. Further, remote access will have to
accommodate an increasing number of physical mediums, such as ISDN, T1,
E1, OC-3 through OC-48, cable, and xDSL
modems.
[0015] Industry consultants have defined a network processor (herein also
mentioned as an "NP") as a programmable communications integrated circuit
capable of performing one or more of the following functions:
[0016] Packet classification--identifying a packet based on known
characteristics, such as address or protocol Packet
modification--modifying the packet to comply with IP, ATM, or other
protocols (for example, updating the time-to-live field in the header for
IP)
[0017] Queue/policy management--reflects the design strategy for packet
queuing, de-queuing, and scheduling of packets for specific applications
Packet forwarding--transmission and receipt of data over the switch
fabric and forwarding or routing the packet to the appropriate address
[0018] Although this definition is an accurate description of the basic
features of early NPs, the full potential capabilities and benefits of
NPs are yet to be realized. Network processors can increase bandwidth and
solve latency problems in a broad range of applications by allowing
networking tasks previously handled in software to be executed in
hardware. In addition, NPs can provide speed improvements through
architectures, such as parallel distributed processing and pipeline
processing designs. These capabilities can enable efficient search
engines, increase throughput, and provide rapid execution of complex
tasks.
[0019] Network processors are expected to become the fundamental network
building block for networks in the same fashion that CPUs are for PCs.
Typical capabilities offered by an NP are real-time processing, security,
store and forward, switch fabric, and IP packet handling and learning
capabilities. NPs target ISO layer two through five and are designed to
optimize network-specific tasks.
[0020] The processor-model NP incorporates multiple general purpose
processors and specialized logic. Suppliers are turning to this design to
provide scalable, flexible solutions that can accommodate change in a
timely and cost-effective fashion. A processor-model NP allows
distributed processing at lower levels of integration, providing higher
throughput, flexibility and control. Programmability can enable easy
migration to new protocols and technologies, without requiring new ASIC
designs. With processor-model NPs, NEVs benefit from reduced
non-refundable engineering costs and improved time-to-market.
BRIEF DESCRIPTION OF THE INVENTION
[0021] One purpose of this invention is to provide a scalable switch
architecture for use in a data communication network which is capable of
sizing support capabilities to a range of potential demands while
improving the speed of handling of data being transferred. This purpose
is pursued by providing components, and assemblages of components, which
remove from the workload of processing units involved a greater amount of
data handling than has been the case heretofore.
[0022] Another purpose is to provide an interface device or network
processor (the terms being used interchangeably) which includes a
plurality of sub-assemblies integrated on a single substrate and coacting
to provide media rate switching of frames that include layer 2, layer 3,
layer 4 and layer 5. The interface device may be used as a standalone
solution providing a first level of capability for a work group switch,
an interconnected solution providing a higher level of capability work
group switch or scaled further upward in capability by cooperation with a
switching fabric device.
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] Some of the purposes of the invention having been stated, others
will appear as the description proceeds, when taken in connection with
the accompanying drawings, in which:
[0024] FIG. 1 shows a block diagram for an interface device in accordance
with this invention.
[0025] FIG. 1A shows a block diagram for the MAC.
[0026] FIGS. 2A through 2D show the interface device interconnected with
other components in different system configurations.
[0027] FIG. 3 shows the flow and processing of an encapsulated guided
frame.
[0028] FIG. 4 shows the flow and processing of an internal guided frame.
[0029] FIG. 5 shows generalized format for a Guided Cell.
[0030] FIG. 6 shows the format for Frame Control Information.
[0031] FIG. 7 shows the format for the Correlator.
[0032] FIG. 8 shows Command Control Information Format.
[0033] FIG. 9 shows Addressing Information Format.
[0034] FIG. 10 shows General Form of Structure Addressing.
[0035] FIG. 11 shows chart for Addressing, Island Encoding.
[0036] FIG. 12A shows a block diagram of the Embedded Processor Complex.
[0037] FIG. 12B shows a schematic of the Embedded Processors.
[0038] FIG. 12C shows a structure for a GxH Processor.
[0039] FIG. 13 shows a block diagram of the memory complex.
[0040] FIG. 14 shows a flowchart for the Fixted Match(FM) search
algorithm.
[0041] FIG. 15 shows flows illustrating Data Structure without using a
Direct Table and with using a Direct Table.
[0042] FIG. 16 shows a block diagram of a switching systems such as
Prizma.
[0043] FIG. 17 shows a block diagram of a CP.
[0044] FIG. 18 shows a block diagram of the single chip Network Processor
highlighting function in the EDS-UP, EDS DOWN and the E-PC.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0045] While the present inventions will be described more fully
hereinafter with reference to the accompanying drawings, in which
preferred embodiments of the present inventions are shown, it is to be
understood at the outset of the description which follows that persons of
skill in the appropriate arts may modify the inventions here described
while still achieving the favorable results of the inventions.
Accordingly, the description which follows is to be understood as being a
broad, teaching disclosure directed to persons of skill in the
appropriate arts, and not as limiting upon the present inventions.
[0046] Apparatus disclosed here is scalable and capable of functioning to
interconnect desktop or workgroup switches, aggregate such switches into
a network backbone, and provide backbone switching services. The
apparatus can support Layer 2, Layer 3, and Layer 4+ forwarding in
hardware. Certain forms of the apparatus are designed for desktop or
workgroup switch aggregation and while others are targeted as core
backbone switches.
[0047] The architecture used for the apparatus is based on an interface
device or network processor hardware subsystem and a software library
running on a control point, all as more fully described elsewhere in this
document. The interface device or network processor subsystem is a high
performance frame forwarding engine designed for parsing and translation
of L2, L3, and L4+ protocol headers. This allows protocols to be switched
at greater speeds using hardware. The interface device or network
processor subsystem provides a fast-path through the box while the
software library and control point processor provide management and route
discovery functions needed to maintain the fast-path. The control point
processor and the software library running thereon together define the
Control Point (CP) of the system. The CP is where the actual bridging and
routing protocols such as Transparent Bridging and OSPF are run. It can
also be referred to as the slow-path of the system.
[0048] While the apparatus here disclosed supports multi-layer forwarding
in hardware it can also operate as a L2 only switch and that is its
default mode of operation in the simplest form disclosed. Each port will
be put into a single domain allowing any device to communicate with any
other device. The apparatus is configurable at L2 allowing system
administrators the ability to configure features such as; grouping ports
into separate domains or trunks, configuring Virtual LAN (VLAN) segments,
or filters to control broadcast and multicast traffic.
[0049] This scalable apparatus has many benefits. First, it allows the
system administrator the ability to configure L3 forwarding and routing
of IP and IPX traffic using the same hardware being used for L2 and at
the same speed. Second, it removes the need for using external routers to
interconnect campus buildings while increasing performance at the same
time. Third, it simplifies or combines the management of L2/L3 service
for a building into a single point of control. Finally, it provides value
added features with L4+ functions that allow system administrators the
ability to assign different traffic classifications to support mission
critical applications and network dispatcher for load-balancing among
servers.
[0050] The apparatus is designed to be a modular unit using an interface
device or network processor, a Control Point (CP), and an optional
switching fabric device as its fundamental building blocks. The interface
device preferably provides L2/L3/L4+ fast-path forwarding services while
the CP provides the management and route discovery functions needed to
maintain the fast-path. The optional switching fabric device is used when
more than two interface device subsystems are tied together. The optional
switching fabric device may be as disclosed in U.S. Pat. No. 5,008,878
issued Apr. 16, 1991 for High Speed Modular Switching Apparatus for
Circuit and Packet Switched Traffic mentioned hereinabove and
incorporated herein by reference.
[0051] The apparatus is anticipated to be assembled using printed circuit
board elements also here mentioned as "blades". The printed circuit board
elements have circuit elements mounted thereon and are received in
connectors provided in apparatus housings. Similar devices are also know
as "option cards". The apparatus contemplates that blades can be
exchanged among varying chassis or housings, provided that appropriate
connectors and backplane electrical connections are provided. The basic
component found on all blades is a carrier subsystem. Starting with the
carrier subsystem, three types of blades can be produced. The first type
is a CP only Elade, which consists of a carrier subsystem and a CP
subsystem. The primary use of a CP only blade is for a product where
redundancy is the primary concern. The second type is a CP+ Media Blade,
which consists of a carrier subsystem, a CP subsystem, and 1-to-3 media
subsystems. The primary use of a CP+ Media blade is a product where port
density is deemed more important than redundancy. The third type is a
Media Blade, which consists of a carrier subsystem and 1-to-4 media
subsystems. The media blades can be used in any chassis and the type of
media subsystem used is configurable.
[0052] Blade management will involve fault detection, power management,
new device detection, initialization, and configuration. This management
will be done using various registers, I/O signals, and a guided cell
interface that is used to communicate between the CP and carrier
subsystems. However, unlike the chassis there does exist programmable
devices and memory on all blades. The amount of programmability depends
on the type of blade. When the CP subsystem exists on a blade both the CP
and carrier subsystems are programmable. The media subsystems are also
programmable but only indirectly through the carrier subsystem.
[0053] In higher capability products there also exists a Switch Blade
which contains the switching fabric device subsystem. The management of
this blade will involve fault detection, power management, new device
detection, and initialization. This management will be done using various
registers and I/O signals that will be mapped into the CP subsystem.
[0054] In its simplest form, a switch apparatus contemplated by this
invention has a control point processor; and an interface device
operatively connected to the control point processor. Preferably and as
here disclosed, the interface device (also known as a network processor)
is a unitary Very Large Scale Integrated (VLSI) circuit device or chip
which has a semiconductor substrate; a plurality of interface processors
formed on the substrate; internal instruction memory formed on said
substrate and storing instructions accessibly to the interface
processors; internal data memory formed on the substrate and storing data
passing through the device accessibly to the interface processors; and a
plurality of input/output ports. The interface processors are also
sometimes herein identified as picoprocessors or processing units. The
ports provided include at least one ports connecting the internal data
memory with external data memory and at least two other ports exchanging
data passing through the interface device with an external network under
the direction of the interface processors. The control point cooperates
with the interface device by loading into the instruction memory
instructions to be executed by the interface processors in directing the
exchange of data between the data exchange input/output ports and the
flow of data through the data memory.
[0055] The network processor here disclosed is deemed inventive apart from
the switch assemblies into which it is incorporated. Further, the network
processor here disclosed is deemed to have within its elements here
described other and further inventions not here fully discussed.
[0056] FIG. 1 shows a block diagram for the interface device chip that
includes substrate 10 and a plurality of sub-assemblies integrated on the
substrate. The sub-assemblies are arranged into an Upside configuration
and a Downside configuration. As used herein, "Upside" refers to data
flows inbound from a network to the apparatus here disclosed, while
"Downside" refers to data outbound from the apparatus to a network
serviced by the apparatus. The data flow follows the respective
configurations. As a consequence, there is an Upside data flow and a
Downside data flow. The sub-assemblies in the Upside include
Enqueue-Dequeue-Scheduling UP (EDS-UP) logic 16, multiplexed MAC's-UP
(PPM-UP) 14, Switch Data Mover-UP (SDM-UP), 18, System Interface (SIF)
20, Data Align Serial L1nk A (DASLA) 22, and Data Align Serial L1nk B
(DASLB) 24. A data align serial link is more fully described in copending
U.S. patent application Ser. No. 09/330,968 filed Jun. 11, 1999 and
entitled "High Speed Parallel/Serial Link for Data Communication"
mentioned hereinabove and incorporated by reference hereinto. While the
preferred form of the apparatus of this invention here disclosed uses a
DASL link, the present invention contemplates that other forms of links
may be employed to achieve relatively high data flow rates, particularly
where the data flows are restricted to being within the VLSI structure.
[0057] The sub-assemblies in the downside include DASL-A 26, DASL-B 28,
SIF 30, SDM-DN 32, EDS-DN 34, and PPM-DN 36. The chip also includes a
plurality of internal S-RAM's, Traffic Mgt Scheduler 40, and Embedded
Processor Complex (EPC) 12. An interface device 38 is coupled by
respective DMU Busses to PMM 14 and 36. The interface 38 could be any
suitable L1 circuitry, such as ethernet Physical (ENET PHY), ATM Framer,
etc. The type of interface is dictated in part by the network media to
which the chip is connected. A plurality of external D-RAM's and S-RAM
are available for use by the chip.
[0058] While here particularly disclosed for networks in which the general
data flow outside the relevant switching and routing devices is passed
through electrical conductors such as wires and cables installed in
buildings, the present invention contemplates that the
network switches
and components thereof here disclosed may be used in a wireless
environment as well. By way of an illustrative example, the media access
control (MAC) elements here described may be replaced by suitable radio
frequency elements, possibly using known Silicon Germanium technology,
which would result in a capability to link the elements here described
directly to a wireless network. Where such technology is appropriately
employed, the radio frequency elements can, by person of appropriate
skill in the applicable arts, be integrated into the VILSI structures
here disclosed. Alternatively, radio frequency or otherwise wireless
response devices such as infrared responsive devices can be mounted on a
blade with other elements here disclosed to achieve a switch apparatus
useful with wireless network systems.
[0059] The arrows show the general flow of data within the Interface
device. Frames received from an Ethernet MAC are placed in internal Data
Store buffers by the EDS-UP. These frames are identified as either normal
Data Frames or system control Guided Frames and enqueued to the EPC (FIG.
1). The EPC contains N protocol processors capable of working on up to N
frames in parallel,(N>1). In an embodiment of ten protocol processor
(FIG. 12B), two of the ten protocol processors are specialized; one for
handling Guided Frames (the Generic: Central Handler or GCH) and one for
building Lookup Data in Control Memory (the Generic Tree Handler or GTH).
As shown in FIG. 12A, the EPC also contains a dispatcher which matches
new frames with idle processors, a completion unit which maintains frame
sequence, a Common Instruction memory shared by all ten processors, a
Classifier Hardware Assist which determines frame classification and
coprocessor which helps determine the starting instruction address of the
frame, Ingress and Egress Data Store interfaces which control read and
write operations of frame buffers, a Control Memory Arbiter which allows
the ten processors to share Control Memory, a Web Control, Arbiter and
interface that allows debug access to internal Interface device data
structures, as well as other hardware constructs.
[0060] Guided Frames are sent by the dispatcher to the GCH processor as it
becomes available. Operations encoded in the Guided Frame are executed,
such as register writes, counter reads, Ethernet MAC configuration
changes, and so on. Lookup table alterations, such as adding MAC or IP
entries, are passed on to the Lookup Data processor for Control Memory
operations, such as memory reads and writes. Some commands, such as MIB
counter reads, require a response frame to be built and forwarded to the
appropriate port on the appropriate Interface device. In some cases, the
Guided Frame is encoded for the Egress side of Interface device. These
frames are forwarded to the Egress side of the Interface device being
queried, which then executes the encoded operations and builds any
appropriate response frame.
[0061] Data frames are dispatched to the next available protocol processor
for performing frame lookups. Frame data are passed to the protocol
processor along with results from the Classifier Hardware Assist (CHA)
Engine. The CHA parses IP or IPX. The results determine the Tree Search
algorithm and starting Common Instruction Address (CIA). Tree Search
algorithms supported included Fixed Match Trees (fixed size patterns
requiring exact match, such as Layer 2 Ethernet MAC tables), Longest
prefix Match Trees (variable length patterns requiring variable length
matches, such as subnet IP forwarding) and Software Managed Trees (two
patterns defining either a range or a bit mask set, such as used for
filter rules).
[0062] Lookup is performed with the aid of the Tree Search Engine (TSE)
Coprocessor, which is a part of each protocol processor. The TSE
Coprocessor performs Control memory accesses, freeing the protocol
processor to continue execution. Control memory stores all tables,
counters, and other data needed by the picocode. Control memory
operations are managed by the Control memory Arbiter, which arbitrates
memory access among the ten processor complexes.
[0063] Frame data are accessed through the Data Store Coprocessor. The
Data Store Coprocessor contains a primary data buffer (holding up to
eight 16 byte segments of frame data), a scratch pad data buffer (also
holding up to eight 16-byte segments of frame data) and some control
registers for Data Store operations. Once a match is found, Ingress frame
alterations may include a VLANI header insertion or overlay. This
alteration is not performed by the interface device processor complex,
but rather hardware flags are derived and other Ingress Switch Interface
hardware performs the alterations. Other frame alterations can be
accomplished by the picocode and the Data Store Colprocessor by modifying
the frame contents held in the Ingress Data Store.
[0064] Other data are gathered and used to build Switch Headers and Frame
Headers prior to sending frames to the switch fabric device. Control data
include switch information, such as the destination blade of the frame,
as well as information for the Egress Interface device, helping it
expedite frame lookup of destination ports, multicast or unicast
operations, and Egress Frame alterations.
[0065] Upon completion, the Enqueue Coprocessor builds the necessary
formats for enqueuing the frame to the switch fabric and sends them to
the Completion Unit. The Completion Unit guarantees frame order from the
ten protocol processors to the switch fabric queues. Frames from the
switch fabric queues are segmented into 64 byte cells with Frame Header
bytes and Switch Header bytes inserted as they are transmitted to the
Prizma-E Switch.
[0066] Frames received from the switch fabric are placed in Egress Data
Store (Egress DS) buffers by an Egress EDS (34) and enqueued to the EPC.
A portion of the frame is sent by the dispatcher to an idle protocol
processor for performing frame lookups. Frame data are dispatched to the
protocol processor along with data from the Classifier Hardware Assist.
The Classifier Hardware Assist uses frame control data created by the
Ingress Interface device to help determine the beginning Code Instruction
Address (CIA).
[0067] Egress Tree Searches support the same algorithms as supported for
Ingress Searches. Lookup is performed with the TSE Coprocessor, freeing
the protocol processor to continue execution. All Control memory
operations are managed by the Control memory Arbiter, which allocates
memory access among the ten processor complexes.
[0068] Egress frame data are accessed through the Data Store Coprocessor.
The Data Store Coprocessor contains a primary data buffer (holding up to
eight 16-byte segments of frame data), a scratch pad data buffer (also
holding up to eight 16-byte segments of frame data) and some control
registers for Data Store operations. The result of a successful lookup
contains forwarding information and, in some cases, frame alteration
information. Frame alterations can include VLAN header deletion, Time to
L1ve increment (IPX) or decrement (IP), IP Header Checksum recalculation,
Ethernet frame CRC overlay or insertion and MAC DA/SA overlay or
insertion. IP Header checksums are prepared by the Checksum Coprocessor.
Alterations are not performed by the Interface device Processor Complex,
but rather hardware flags are created and PMM Egress hardware performs
the alterations. Upon completion, the Enqueue Coprocessor is sued to help
build the necessary formats for enqueuing the frame in the EDS Egress
queues and sending them to the Completion Unit. The Coimpletion Unit
guarantees frame order from the ten protocol processors to the EDS Egress
queues feeding the egress Ethernet MACs 36.
[0069] The completed frames are finally sent by PMM Egress hardware to the
Ethernet MACs and out the Ethernet ports.
[0070] An internal bus, referred to as the Web, allows access to internal
registers, counters and memory. The Web also includes an external
interface to control instruction step and interrupt control for debugging
and diagnostics.
[0071] Tree Search Engine coprocessor provides memory range checking,
illegal memory access notification and performs tree search instructions
(such as memory read, write or read-add-write) operating in parallel with
protocol processor execution.
[0072] Common Instruction Memory consists of one 1024.times.128 RAM and
two sets of Dual 512.times.128 RAM. Each set of Dual RAMs provides two
copies of the same picocode, allowing processors independent access to
instructions within the same address range. Each 128-bit word includes
four 32-bit instructions, providing a total range of 8192 instructions.
[0073] The Dispatcher controls the passing of frames to the ten protocol
processors and manages interrupts and timers.
[0074] The Completion Unit guarantees frame order from the processor
complex to the switch fabric and target port queues. A rich instruction
set includes conditional execution, packing (for input hash keys),
conditional branching, signed and unsigned operations, counts of leading
zeros and more.
[0075] The Classifier Hardware Assist engine passes each frame's layer 2
and layer 3 protocol header and provides this information with frames as
they are dispatched to the protocol processors.
[0076] The Control memory Arbiter controls processor access to both
internal and external memory.
[0077] External Control memory options include 5 to 7 DDR DRAM subsystems
each supporting a pair of 2M.times.16 bit.times.4 bank or a pair of
4M.times.16 bit.times.4 bank DDR DRAMs. The DDR DRAM interface runs at a
133 MHZ clock rate and a 266 MHZ data strobe supporting configurable CAS
latency and drive strength. An optional 133 MHZ ZBT SRAM can be added in
either a 128K.times.36, 2.times.256K.times.18 or 2.times.512K.times.18
configuration.
[0078] Egress frames may be stored in either one External Data Buffer
(e.g. DS0) or two External Data Buffers (DS0 and DS1). Each Buffer can be
comprised of a pair of 2M.times.16 bit.times.4 bank DDR DRAM (storing up
to 256K 64-byte frames) or a pair of 4M.times.16 bit.times.4 bank DDR
DRAM (storing up to 512K 64-byte frames). Choose the single External Data
Buffer (e.g. DS0) for 2.28 Mbps or add the second Buffer (e.g. DS1) to
support 4.57 Mbps layer 2 and layer 3 switching. Adding the second Buffer
improves performance, but it does not increase frame capacity. The
External Data Buffer interface runs at a 133 MHZ clock rate with a 266
MHZ data strobe and supports configurable CAS latency and drive strength.
[0079] Internal Control memory includes two 512.times.128 bit RAMs, two
1024.times.36 bit RAMs and one 1024.times.64 bit RAM.
[0080] Internal Data storage provides buffering for up to 2048 64-byte
frames in the Ingress direction (UP).
[0081] Fixed Frame alterations include VLAN tag insertions in the Ingress
direction and VLAN tag deletions, Time To L1ve increment/decrement (IP,
IPx), Ethernet CRC overlay/insert and MAC DA/SA overlay/insert in the
Egress (DOWN) direction.
[0082] Port mirroring allows one receive port and one transmit port to be
copied to a system designated observation port without using protocol
processor resources. Mirrored Interface device ports are configured to
add frame and switch control data. A separate data path allows direct
frame enqueuing to the Ingress Switch interface.
[0083] The interface device integrates four Ethernet macros. Each macro
can be individually configured to operate in either 1 Gigabit or 10/100
Fast Ethernet modes. Each Ethernet macro supports the following:
[0084] Up to ten 10/100 Mbps MACs or one 1000 Mbps MACs for each of four
macros.
[0085] FIG. 1 A shows a block diagram of the MAC core. Each macro includes
three Ethernet Core designs; to wit, the multiport 10/100 Mbps MAC Core
(Fenet), the 1000 Mbps MAC core (Genet) and the 100 Mbps Physical Coding
Sublayer Core (PCS).
[0086] Multi-Port Ethernet 10/100 MAC Features:
[0087] Supports ten Serial Medium Inclependent Interfaces to the physical
layer
[0088] Capable of handling ten ports of 10 Mbps or 100 Mbps media speeds,
any speed mix
[0089] A single MAC services all ten ports with a Time Division Multiplex
interface
[0090] Supports Full/Half duplex operations at media speed on all ports
[0091] Supports IEEE 802.3 Binary Exponential Backoff 1000 Mbps Ethernet
MAC Core Features:
[0092] Supports Gigabit Medium Independent Interface (GMII) to the
physical PCS layer or directly to the physical layer
[0093] With the PCS Core, supports a complete TBI (8b/10b) solution
[0094] Supports Full duplex Point-to-Point connections at media speed
[0095] Supports the IBM PCS Core valid byte signalling
[0096] 1000 Mbps Ethernet Physical Coding Sublayer Core Features:
[0097] Performs 8b/10b encoding and decoding
[0098] Supports the PMA (10 bit) Service Interface as defined in IEEE
802.3z, this interface attaches to any PMA that is compliant with IEEE
802.3z
[0099] Synchronizes data received from the PMA (two phase clock) with the
MAC (single phase) clock
[0100] Supports Auto-Negotiation including two next pages
[0101] Converts from a two phase clock system defined in the standards to
a single phase clock
[0102] Provides a signal to the MAC indicating those clock cycles that
contain new data
[0103] Checks the received code groups (10 bits) for COMMA's and
establishes word sync
[0104] Calculates and checks the 8b/10b running disparity
[0105] FIGS. 2A-2D show different configurations for the Interface device
Chip. The configurations are facilitated by DASL and connection to a
switching fabric device. Each DASL includes two channels; namely, a
transmit channel and a receiver channel.
[0106] FIG. 2A shows a wrap configuration for a single Interface device.
In this configuration, the transmit channel is wrapped to the receive
channel.
[0107] FIG. 2B shows the configuration in which two Interface device Chips
are connected. Each Interface device Chips is provided with at least two
DASLs. In this configuration, the channels on one DASL on one chip are
operatively connected to the channels of a matching DASL on the other
chip. The other DASL on the chip is wrapped.
[0108] FIG. 2C shows the configuration in which multiple Interface devices
are connected to a switch fabric. The double headed arrows indicate
transmission in both direction.
[0109] FIG. 2D shows the configuration in which a Main switch and a Backup
switch are connected to Multiple Interface devices. If the main switch
goes down, the backup is available for use.
[0110] A Control Point (CP) includes a System Processor that is connected
to each of the configuration. The system processor at the CP, among other
things, provides initialization and configuration services to the chip.
The CP may be located in any of three locations: in the interface device
chip; on the blade on which the chip is mounted or external to the blade.
If external to the blade, the CP may be remote; that is, housed elsewhere
and communicating by the network to which the interface device and CP are
attached. The elements of a CP are shown in FIG. 17 and include memory
elements (cache, flash and SDRAM), a memory controller, a PCI bus, and
connectors for a backplane and for L1 network media.
[0111] FIG. 18 shows the single chip Network Processor and the functions
provided by the EDS-UP, the traffic Management (MIGT) Schedules and the
EDS-DOWN (DN). The U-shaped icon separates queues and the Control Blocker
(CB) that keeps track of the contents in the queues are represented by
rectangular icons.
[0112] A description of the elements, their respective functions and
interaction follows.
[0113] PMM: This is the part of the Network Processors that contains the
MACs (FEnet, POS, GEnet) and attaches to the external PHY devices.
[0114] UP-PMM: This logic takes bytes from the PHYs, and formats it into
FISH (16 bytes) to pass on to the UP-EDS. There are 4 DMUs within the
PMM, each capable of working with 1 GEnet or 10 FEnet devices.
[0115] UP-EDS: This logic takes the fish from UP-PMM and stores them into
the UP-Data Store (internal RAM). It is capable of working on 40 frames
at once, and after the appropriate number of bytes are received, it will
enqueue the frame to the EPC. When the EPC is finished with the frame,
the UP-EDS will enqueue the frame into the appropriate Target Port Queue
and start sending the frame to the UP-SDM. The UP-EDS is responsible for
all buffer and frame management and returns the buffers/frames back to
free pools when the transfer to UP-SDM is complete.
[0116] EPC: This logic contains the picoprocessors and (could) contain the
embedded PowerPC. This logic is capable of looking at the frame header
and deciding what to do with the frame (forward, modify, filter, etc.).
The EPC has access to several lookup tables, and hardware assists to
allow the picoprocessors to keep up with the high-bandwidth requirements
of the Network Processor.
[0117] UP-SDM: This logic takes the frames, and formats them into PRIZMA
cells for transmission to the switch fabric. This logic is also capable
of inserting the VLAN header into the frame.
[0118] UP-SIF: This logic contains the UP-DASL macros and attaches to the
external switch I/Os.
[0119] DN-SIF: This logic contains the DN-DASL macros and receives PRIZMA
cells from the external I/Os.
[0120] DN-SDM: This logic receives the PRIZMA cells and preprocesses them
for help in frame reassembly.
[0121] DN-EDS: This logic takes each cell and assembles them back into
frames. The cell is stored into external Data Store, and buffers are
linked together to make frames. When the entire frame is received, the
frame will be enqueued to the EPC. After EPC is finished with the frame,
it is enqueued to the Scheduler (if present) or the Target Port Queues.
DN-EDS then sends the frames to the appropriate port by sending the
frame, any alteration information, and some control information to the
DN-PMM.
[0122] DN-PMM: Takes the information from DN-EDS and formats the frame
into Ethernet, POS, etc. and sends the frame to the external PHY.
[0123] SPM: This logic is used to allow the Network Processor to interface
to external devices (PHYs, LEDs, FLASH, etc) but only requires 3 I/Os.
The Network Processor uses a serial interface to communicate to SPM and
then SPM preforms the necessary functions to manage these external
devices.
[0124] UP-SIDE Flow
[0125] 1) Frame arrive at PHY
[0126] 2) Bytes are received by UP-PMM
[0127] 3) UP-PMM sends FISH over to UP-EDS (Fish means a portion of a
frame)
[0128] 4) UP-EDS stores FISH into UP-DS
[0129] 5) UP-EDS sends header over to EPC
[0130] 6) EPC processes header and sends enqueue information back to
UP-EDS
[0131] 7) UP-EDS continues to receive the remainder of frame from UP-PMM
[0132] 8) UP-EDS sends information to UP-SDM when appropriate data is
ready to send to switch
[0133] 9) UP-SDM reads frame data and formats it into PRIZMA cells
[0134] 10) UP-SDM sends cells to UP-SIF
[0135] 11) UP-SIF transfers the cells over-the DASL serial links to PRIZMA
[0136] 12) UP-EDS frees up buffers/frames when all the data has been taken
DN-SIDE Flow
[0137] 1) DN-SIF receives PRIZMA cells
[0138] 2) DN-SDM stores cells and preprocesses them for reassembly
information
[0139] 3) DN-EDS receives the cell data and reassembly information and
links the cell into a new frame on down side
[0140] 4) DN-EDS stores the cell into DN-DS
[0141] 5) DN-EDS enqueues the frame to EPC when all of the data have been
received
[0142] 6) EPC processes the header and sends enqueue information back to
DN-EDS
[0143] 7) DN-EDS enqueues the frame into a scheduler queue (if present) or
a Target Port Queue
[0144] 8) DN-EDS services the queues and sends frame information into the
PCB
[0145] 9) DN-EDS uses the PCB to "unravel" the frame and reads the
appropriate data and sends that data to DN-PMM
[0146] 10) DN-PMM formats the data (with alteration if requested) and
sends the frame to the external PHY
[0147] 11) DN-PMM informs DN-EDS when buffers are no longer needed and
DN-EDS frees theses resources
[0148] FRAME Control Flow
[0149] 1) Header is send to EPC from UP-DS or DN-DS
[0150] 2) EPC looks up header information in lookup tables and receives
frame enqueue information
[0151] 3) EPC sends the enqueue information back to the EDS and the frame
is enqueued to the appropriate queue
[0152] 4) Cell Headers and Frame Headlers are sent along with the frame
data to aid in reassembly and frame forwarding
[0153] CP Control Flow
[0154] 1) Control Point formats a Guided Frame and sends it to the Network
Processor
[0155] 2) The Network Processor enqueues the Guided Frame to the GCH
picoprocessor
[0156] 3) The GCH processes the Guided Frame and reads or writes the
requested areas of Rainier
[0157] 4) The GCH passed any Table update requests over to the GTH
[0158] 5) The GTH updates the appropriate table with information from
Guided Frame
[0159] 6) An acknowledgement Guided Frame is sent back to CP
[0160] Network Processor Control Flow
[0161] 1) A Picoprocessor can build a Guided Frame to send information to
another Rainier or the Control Point
[0162] 2) The Guided Frame is sent to the appropriate location for
processing
[0163] A single Interface device provides media speed switching for up to
40 Fast Ethernet Ports (FIG. 2A). 80 Fast Ethernet Ports are supported
when two Interface devices are interconnected using IBM's Data Aligned
Synchronous L1nk (DASL) technology (FIG. 2B). Each DASL differential pair
carriers 440 Mbps of data. Two sets of eight pairs provide a 3.5 Gbps
duplex connection (8 times 440 Mbps in each direction). As shown in FIGS.
2C and 2D, larger systems can built by interconnecting multiple Interface
devices to a switch such as IBM's Prizma-E switch. The Interface device
provides two of the 3.5 Gbps duplex DASL connections, one primary and one
secondary, which can be used to provide a wrap-backpath for local frame
traffic (when two Interface devices are directly connected, FIG. 2B) or a
connection to a redundant switch fabric (FIG. 2D, Backup Sw.). In view of
the above, the single Network Processor Chip is scaleable in that one
chip can be used to provide a low end system (having relatively low port
density--say 40) to high end system (having relatively high port density,
say 80-n ports).
[0164] One Interface device in the system is connected to the system
processor via up to ten 10/100 Mbps Fast Ethernet ports or a single 1000
Mbps Ethernet port. The Ethernet configuration to the system processor is
placed in an EEPROM attached to the Interface device and loaded during
initialization. The system processor communicates with all Interface
devices in a system (see FIG. 2) by building special Guided Frames
encapsulated as ethernet frames. The encapsulated Guided Frames are
forwarded across the DASL link to other devices allowing all of the
Interface devices in the system to be controlled from a single point.
[0165] Guided Frames are used to communicate control information between
the Control Point (CP) and the Embedded Processor Complex and within the
interface device. A prior disclosure of Guided Cells which will elucidate
the discussion here is found in U.S. Pat. No. 5,724,348 issued Mar. 3,
1998 for Efficient Hardware/Software Interface for a Data Switch"
mentioned hereinabove and incorporated hereinto by reference.
[0166] For Guided Frame traffic that originates at the CP, the CP
constructs the Guided Frame in data buffers in its local memory. The CP's
Device Driver sends the Guided Frame to one of the media interfaces of
the Network Processor. Media Access Control (MAC) hardware recovers the
Guided Frame and stores it in its internal data store (U_DS) memory. The
Guided Frame is routed to the appropriate blade, processed, and routed
back to the CP as required. Guided Frames passing between an external CP
and the interface device are encapsulated to adapt to the protocol of the
external network. As a consequence, if the external network includes
ethernet, the Guided Frames are encapsulated as ethernet frames and so
forth.
[0167] Ethernet encapsulation provides a means of transport for Guided
Traffic between the CP and the Interface device. The Ethernet MAC (Enet
MAC) of the Interface device does not analyze the Destination Address
(DA) or Source Address (SA) when receiving frames. This analysis is
performed by the EPC picocode. Guided Traffic presumes that the Interface
device has not been configured and the DA and SA cannot be analysed by
the EPC picocode. Therefore, these frames are inherently self-routing.
The Enet MAC does, however, analyse the Ethernet Type field to
distinguish Guided Traffic from Data Traffic. The value of this Ethernet
Type value of the Guided Frame must match the value loaded into the
E_Type_C Register. This register is loaded from Flash Memory by the
Interface device's boot picocode.
[0168] The CP constructs the Guided Frame in data buffers in its local
memory. The contents of a 32 bit register in the CP processor are stored
in big endian format in the local memory as shown in FIG. 3. Having
constructed the Guided Frame, the CP'S Device Driver sends an Ethernet
frame containing a DA for specific Guided Cell Handler (GCH), an SA
corresponding to the global MAC address for the CP or the MAC address for
specific interface, a special Ethernet Type field that indicates a Guided
Frame, and the Guided Frame Data. All Ethernet frames arriving on the
port are received and analyzed by Enet MAC. For frames with an Ethernet
Type value matching the contents of the E_Type_C Register, the Enet MAC
strips off the DA, SA and Ethernet Type fields and stores the Guided
Frame data into the U_DS memory. Bytes are collected by the Enet MAC one
at a time into a block of 16 bytes called a Fish. These bytes are stored
in big endian format with the first byte of the Guided Frame stored in
the most significant byte location of the Fish (Byte 0). Succeeding bytes
are stored in successive byte locations within the Fish (Byte 1, Byte 2,
. . . , Byte 15). These 16 bytes are then stored in a Buffer in the U_DS
beginning at the Fish 0 location. Succeeding Fishes are stored in
successive Fish locations within the Buffer (Fish 1, Fish 2, Fish 3,
etc.). Additional Buffers are obtained from a free pool as required to
store the remainder of the Guided Frame.
[0169] The flow of guided traffic within the interface device 10 is shown
in FIG. 4. The Enet MAC function of the Interface device examines the
frame header information and determines that the frame is a Guided Frame.
The Enet MAC removes the frame header from the Guided Frame and buffers
the remainder of its contents in Interface device's internal U_DS memory.
The Enet MAC, indicates that the frame is to be enqueued to the General
Control (GC) Queue for processing by the GCH. When the end of the Guided
Frame has been reached, the Enqueue, Dequeue, and Schedule (EDS) logic
enqueues the frame into the GC Queue.
[0170] The GCH picocode on the blade locally attached to the CP examines
the Frame Control Information (see FIG. 6) to determine whether the
Guided Frame is intended for other blades in the system and whether the
Guided Frame is to be executed on the down side of the Interface device.
If the frame is intended for blades other than or in addition to the
locally attached blade, the GCH picocode updates the TB value in the
Frame Control Block (FCB) with the TB value from the Guided Frame's Frame
Control information and instructs the EDS to enqueue the frame in the
multicast Target Blade Start of Frame (TB_SOF) Queue. For performance
reasons, all Guided Traffic is enqueued to the multicast TB_SOF queue
independent: of the number of destination blades indicated.
[0171] If the frame is intended for only the locally attached blade, the
GCH picocode examines the up/down field of the Frame Control information
to determine whether the Guided Frame is to be executed on the up or down
side of the Interface device (see FIG. 6). If the Guided Frame is to be
executed on the down side of the Interface device, the GCH picocode
updates the TB value in the FCB with the TB value from the Guided Frame's
Frame Control information and instructs the EDS to enqueue the frame in
the multicast Target Blade Start of Frame (TB_SOF) Queue. If the Frame
Control information indicates that the Guided Frame is to be executed on
the up side, the GCH picocode analyzes the Guided Frame and performs the
operations indicated by the Guided Commands it contains.
[0172] Prior to processing of Guided Commands, the picocode checks the
value of the ack/noack field of the Frame Control information. If this
value is `0'b, then the Guided Frame is discarded following processing.
Guided read commands shall not be of this category.
[0173] If the value of the ack noak field is `1'b, and the value of the
early/late field is `1'b, then prior to processing any of the Guided
Commands in the Guided Frame, the picocode constructs an Early Ack Guided
Frame with the value of the TB field of the Frame Control equal to the
contents of the Early_Ack Guided Frame with the value of the TB field of
the Frame Control equal to the contents of the My_TB Register. The
picocode routes the Early Ack Guided Frame back to the CP by updating the
TB value in the frame's FCB with the value contained in the TB field of
the LAN Control Point Address (LANLCP_Addr) Register and instructing the
EDS to enqueue the frame in the multicast TB_SOF Queue. The picocode then
processes the Guided Commands of the Guided Frame and discards the Guided
Frame. Guided read commands shall not be of this category.
[0174] If, on the other hand, the value of the ack/noack field is `1'b,
and the value of the early/late field is `0'b, the picocode changes the
resp/req field of the Frame Control information to `1'b to indicate a
Guided Frame response, replaces the TB field with the contents of the
My_TB Register, and processes each Guided Command within the Guided
Frame. During the course of processing a Guided Command, the picocode
updates the Completion Code field of the next Guided Command with the
completion status code value for the current Guided Command. The picocode
routes the response back to the source by updating the TB value in the
(FCB) with the value corresponding to the Source Blade (LAN_CP_Addr
Register value for CP) and instructing the EDS to enqueue the frame in
the multicast TB_SOF Queue.
[0175] Frames residing in the TB_SOF: Queue are scheduled for forwarding
by the EDS. The Switch Data Mover (SDM) builds the switching fabric Cell
Header and Interface device Frame Header from the information contained
in the FCB. These cells pass through the switching fabric device and
arrive at the target blade where the cells are reassembled into a frame
in the D-DS memory. The SDM of the down side recognizes that the frame is
a Guided Frame and signals the EDS to enqueue it in the GC Queue.
[0176] Pressure from the GC Queue or the GT Queue stimulates the picocode
to access and analyse the Guided Frames. All Guided Frames arriving on
the down side are initially enqueued in the GC Queue. The gch/gth value
of the Frame Control Information for these frames is examined by GCH
picocode. If the gch/gth value is `0'b, the Guided Frame is enqueued in
the GT Queue. Otherwise, the GCH picocode examines the resp/req field of
the Frame Control information to determine if the Guided Frame has
already been executed. If the resp/req has a value of `1'b, then the
Guided Frame has already been executed and is routed to the CP. Target
port values corresponding to CP connections are maintained by EPC
picocode. Frames I:rom these Target Port queues are transmitted from the
Interface device back to the CP.
[0177] If the resp/req field has a value of `0'b, then the blade may be
local or remote with respect to the CP. This is resolved by comparing the
value of the TB field of the LAN_CP_Addr Register with the contents of
the My Target Blade (My_TB) Register. If they match, then the blade is
local to the CIP, otherwise, the blade is remote form the CP. In either
case, the picocode examines the up/down value of the Frame Control
Information. If up/down is equal to `1'b, then the frame is enqueued in
the Wrap TP queue for forwarding to the U_DS and processing by the GCH on
the up side. Otherwise, the picocode (GCH or Gth) performs the operations
indicted by the Guided Commands contained in the Guided Frame. Prior to
processing of the Guided Commands, the picocode checks the value of the
ack/noack field of the Frame Control information. If this value is `0'b,
then the Guided Frame is discarded following processing. Guided read
commands shall not be of this category.
[0178] If the value of the ack/noack field is `1'b and the value of the
early/late field is `1'b, then prior to processing any of the guided
commands in the guided frame, the picocode constructs an Early Ack Guided
Frame with the value of the TB field of the Frame Control information
equal to the contents of the My_TB Register. If the bade is remote from
the CP, the picocode routes the Early Acks Guided Frame to the Wrap Port.
Otherwise, the blade is local to the CP and the frame is routed to the
Port Queue corresponding to the CP. The picocode processes the Guided
Commands while either the Wrap Port moves the Early Ack Guided Frame from
the D_DS to the U_DS and enqueues the frame in the GC Queue on the up
side or the frame is transmitted from the Port Queue back to the CP. For
frames wrapped back to the U_DS, the GCH picocode again sees this frame,
but the resp/req field will have a value of `1'b. The GCH picocode routes
the frame back to the CP by updating the TB field in the FCB with the
value contained in the TB field of the LAN_CP_Addr Register and
instructing the EDS to enqueue the frame in the multicast TB_SOF Queue.
Frames residing in the TB_SOF Queue are scheduled for forwarding by the
EDS. The SDM builds the Prizma Cell Header and Interface device Frame
header from information contained in the FCB. Cells from this frame pass
through Prizma and are reassembled into a frame on the CP's local blade.
The SDM of the down side recognizes that the frame is a Guided Frame and
signals the EDS to enqueue it in the GC Queue.
[0179] This time when the GCH picocode analyzes the frame, the resp/req
field has a value of `1'b. This implies that this blade is locally
attached to the CP and the Guided Frame is routed to the Port Queue
corresponding to the CP. Frames from this queue are transmitted from
Interface device back to the CP.
[0180] If, on the other hand, the value of the ack/noack field is `1'b and
the value of the early/late field is `0'b, the picocode changes the
resp/req field to `1'b to indicate a Guided Frame response, replaces the
TB field with the contents of the My_TB Register, and then processes each
Guided Command within the Guided Frame. During the course of processing a
Guided Command, the picocode updates the Completion Code field of the
next Guided Command with the completion status code value for the current
Guided Command. If the blade is remote from the CP, then the picocode
routes the Guided Frame to the Wrap Port. Otherwise, the blade is local
to the CP and the frame is routed to the Port Queue corresponding to the
CP. Either the Wrap Port moves the Guided Frame from the D_DS to the U_DS
and enqueues the frame in the GC Queue on the up side or the frame is
transmitted form the Port Queue back to the CP. For frames wrapped back
to the U_DS, the GCH picocode again sees this frame, but the resp/req
field will have a value of `1'b. The GCH picocode routes the frame back
to the CP by updating the TB field in the FCB with the value contained in
the TB field of the LAN_CP_Addr Register and instructing the EDS to
enqueue the frame in the multicast TB_SOF Queue. Frames residing in the
TB_SOF Queue are scheduled for forwarding by the EDS. The SDM builds the
Prizma Cell Header and Interface device Frame header from information
contained in the FCB. Cells from this frame pass through Prizma and are
reassembled into a frame on the down side of the CP's local blade. The
SDM of the down side recognizes that the frame is a Guided Frame and
signals the EDS to enqueue it in the GC Queue. This time when the GCH
picocode analyzes the frame from the D_DS, the resp/req field has a value
of `1'b. This implies that this blade is locally attached to the CP and
the Guided Frame is routed to the Port Queue corresponding to the CFP.
Frames from this queue are transmitted from Interface device back to the
CP.
[0181] If, for any reason, the GCH picocode encounters a Guided Frame with
the TB field of the Frame Control information equal to `0000'h, then the
GCH picocode interprets the frame as intended for only this blade and act
accordingly. This action is required during initialization when the value
of the My_TB Register is `0000'h for all blades. The CP will initialize
the My_TB Register of the locally attached blade by sending Write Guided
[0182] Command in a Guided Frame whose Frame Control Information has a TB
value of `0000'h.
[0183] Any of the picoprocessors within the EPC can generate a Guided
Frame. This frame can be the Unsolicited Guided Frame or any other form
of Guided Frame. Internally generated frames of this type are constructed
in a way that does not allow acknowledgment (i.e. ack/noack=`0'b). These
frames may be sent to one of the two picoprocessors (GCH or GTH) within
the same EPC or to the GCH or GTH of some other blade.
[0184] Unsolicited Guided Frames may also be sent to the CP. Guided Frames
destined for the same EPC are constructed using data buffers in the D_DS.
These frames are then enqueued in the GC or GT Queue for processing.
These frames are then processed and discarded in the usual manner.
Unsolicited Guided Frames destined for the locally attached CP are
constructed using data buffers in the D_DS. These frames are constructed
in a way that indicates that they have been executed by the EPC (i.e.
resp/req =`1'b, and TB=My_TB). These frames are enqueued in the Port
Queue corresponding to the CP. Frames from this queue are transmitted
back to the CP.
[0185] Guided Frames destined for another blade can be constructed using
data buffers in the D_DS or the U_DS. Unsolicited Guided Frames destined
for the CP are constructed in a way that indicates that they have been
executed by the EPC (i.e. resp/keq=`1'b, and TB=My_TB). Frames
constructed using buffers from the D_DS are enqueued to the Wrap Port.
These frames are moved to the U_DS and enqueued to the GC Queue on the up
side. Unsolicited Guided Frames with a resp/req value of `1'b will be
routed to the CP using TB value in the LAND_CP_Addr Register. Otherwise,
the GCH picocode routes these frames using the TB value of the Frame
Control Information of the Guided Frame. At the receiving blade, the
frame is enqueued to the GC Queue of the down side. The GCH of this blade
executes and discard the frame (resp/keq=`0'b and gch/gth=`1'), or
enqueues the frame to the GT Queue (resp/req=`0'b and gch/gth=`0'), or
enqueues the frame to the Port Queue corresponding to the CP
(resp/req=`1'b). Frames constructed using data buffers in the U_DS are
enqueued directly into the GC Queue of the up side. From this point
forward, these frames follow the same route and are handled in the same
way as those constructed using D_DS data Buffers. FIG. 5 shows the
generalized format for guided frames.
[0186] The format shown is a logical representation with the most
significant byte on the left and the least significant byte on the right.
Four byte words begin with word 0 at the top and increase towards the
bottom of the page.
[0187] Since Guided Frames must be routed and processed before the
interface device has been configured by the CP, these frames must be
self-routing. The results normally obtained by look-up and classification
are contained in this Frame Control information field of the Guided Frame
allowing the chip to update the FCB with this information without
performing a look-up operation. The target blade information contained in
the Guided Frame is used by the Guided Frame Handler to prepare the Leaf
Page field of the FCB. The CP provides the Target Blade information while
the GCH picocode fills in the other fields in the FCB. This FCB
information is used by the SDM to prepare the Cell and Frame headers. The
format of the Frame Control information field of the Guided Frame is
shown in FIG. 6.
[0188] An explanation for the abbreviation at each bit position in FIG. 6
follows:
[0189] resp/req Response and Not Request indicator value. This field is
used to differentiate between request (unprocessed) and response Guided
Frames. 0 request 1 response
[0190] ack/noack Acknowledgment or No Acknowledgment control value. This
field is use to control whether (ack) or not (noack) the GCH picocode
acknowledges the Guided Frame. Guided Frames that are not to be
acknowledged shall not contain any form of Guided Command that performs a
read. 0 No Acknowledgment 1 Acknowledgment
[0191] early/late Early and Late Acknowledgment control value. This field
is used to control whether the acknowledgment requested (ack/noack=`1'b)
occurs before (early) or after (late) the Guided Frame has been
processed. This field is ignored when ack/ oack=`0'b. 0 Acknowledge after
Guided Frame processing 1 Acknowledge before Guided Frame processing
[0192] neg/all Negative Acknowledgment or Acknowledge All control value.
This field is ignored when the ack/noack field has a value of `0'b unless
a guided command does not complete successfully. 0 Acknowledge all Guided
Frames if ack/noack=`1'b. Early or Late Acknowledgment determined by
value of early/late. 1 Acknowledge only Guided Frames that do not
complete successfully. This acknowledgment will occur independent of the
values of ack/noack and early/late and will of course be a late
acknowledgment.
[0193] up/down Up or Down control value. This value is used to control
whether the frame is processed on the up side or the down side. This
field is ignored when resp/req is `1'b. All multicast Guided Frames shall
have an up/down value of `0'b. In addition, Guided Commands that require
the use of GTH hardware assist instructions shall have an up/down value
of `0'b. 0 Down side processing 1 Up side processing
[0194] gth/gch General Tree Handler or Guided Cell Handler control value.
This value is used to direct Guided Frames to the proper picoprocessor. 0
GCH picoprocessor 1 GTH picoprocessor
[0195] TB Target Blade value. When resp/req is `0'b, this field contains
routing information used by Prizma. Each bit position corresponds to a
Target Blade. If this value is `0000'h, then the Guided Frame is assumed
to be for this blade and is executed accordingly. A value of `1'b in one
or more bit positions of the TB field indicates that the cell is routed
to the corresponding Target Blade(s). When nesp/req is `1'b, the field
contains the My_TB value of the responding blade.
[0196] Word 1 of the Guided Frame contains a correlator value (FIG. 7).
This value is assigned by the CP software to correlate Guided Frame
responses with their requests. The Correlator includes a plurality of
bits with assigned functions.
[0197] Every Guided Command begins with a Command Control Information
field. This Command Control contains information that aids the GCH
picocode in processing a Guided Frame. The format for this information is
shown in FIG. 8.
[0198] Length value: This value indicates the total number of 32 bit words
contained in the Control Information (Cmd Word 0), The Address
Information (Cmd Word 1), and Operand (Cmd Words 2+) portions of the
Guided Frame.
[0199] Completion Code value: This field is initialized by the CP and is
modified by the GCH picocode when processing Guided Commands. The GCH
picocode uses this field for completion status for the preceding Guided
Command in the command list. Since all Guided Command lists terminate
with the End Delimiter Guided Command, the completion status of the last
command is contained in the End Delimiter's Completion Code field.
[0200] Guided Command type value (Symbolic Name)
1
Type
Symbolic Name Value Type Description
End_Delimiter 0000 mark the end of a Guided Frame
sequence
Build_TSE_Free_List 0001 build a free list.
Software_Action 0010 execute software action
Unsolicited 0011
frames initiated by the EPC picocode
Block_Write 0100 write a
block of data to consecutive
addresses
Duplicate_Write
0101 write duplicate data to registers or
memory.
Read
0110 request and respond for reading
register or memory data
0111 reserved
Insert_Leaf 1000 insert a leaf into the search
tree.
Update_Leaf 1001 update a leaf of the search tree
Read_Leaf 1010 request and respond for reading of
Leaf Page data
1011 reserved
Delete_Leaf 1100 delete a leaf of the search
tree
1101- reserved
1111
[0201] The addressing information contained in the Guided Frame identifies
an element within the Networking Processor's addressing scheme. The
general form for the Address Information field is shown in FIG. 9.
[0202] The Interface device employs a 312 bit addressing scheme. This
addressing scheme assigns an address value to every accessible structure
of the Interface device. These structures are either internal to the
Processor or connected to interfaces under the control of the Processor.
Some of these structures are accessed by the Embedded Processor Complex
(EPC) via an internal interface-called the Web Interface. The remainder
of the structures are accessed via memory controller interfaces. In all
cases the general form of the address is shown in FIG. 10.
[0203] The Network Controller is subdivided into major chip islands. Each
island is given a unique Island ID value. This 5 bit Island ID value
forms the 5 most significant bits of the address for structures
controlled by that chip island. The correspondence between encoded Island
ID value and the chip island name is shown in FIG. 11. The second portion
of the Web address consists of the next most significant 23 bits. This
address field is segmented into a structure address portion and an
element address portion. The number of bits used for each segment may
vary from island to island. Some islands may contain only a few large
structures while others may contain many small structures. For that
reason there is no fixed size for these address segments. The structure
address portion is used to address an array within the island while the
element address portion is used to address an element within the array.
The remaining portion of the address is to accommodate the Web
Interface's 32 bit data bus limitation. This 4 bit word address is used
for selecting 32 bit segments of the addressed element:. This is
necessary for moving structure elements wider than 32 bits across the
Network Controller's Web Data Bus. Word address value `0'h refers to the
32 most significant bits of the structure element while sequential word
address values correspond to successively less significant segments of
the structure element. The word address portion of the address is not
required for structures not accessed via the Web Interface. For this
reason, the Up Data Store, Control Memories, and Down Data Store make use
of the entire 27 least significant bits of address to access structure
elements. Another exception to this format is the address for the SPM
Interface. In that case all 27 bits of address are used and no element is
greater than 32 bits in width.
[0204] The Embedded Processing Complex (EPC) provides and controls the
programmability of the Interface device Chip. It includes the following
components (see also FIG. 12A):
[0205] N processing units, called Gxh: The GxHs concurrently execute
picocode that is stored in a common Instruction Memory. Each GxH consist
of a Processing Unit core, called CLP, which contains a 3-stage pipeline,
16 GPRs and an ALU. Each GxH also contains several coprocessors, like for
example the Tree Search Engine. The GxH is disclosed separately.
[0206] Instruction Memory: Is loaded during initialization and contain the
pico-code for forwarding frames and managing the system.
[0207] A Dispatcher: Dequeues frame-addresses from the up and down
dispatcher queues. After dequeue, the dispatcher pre-fetches part of the
frame-header from the up or down DataStore (DS) and stores this in an
internal memory. As soon as a GxH becomes idle, the Dispatcher passes the
frame header with appropriate control information, like the Code
Instruction Address (CIA) to the GxH. The dispatcher also
handles timers
and interrupts.
[0208] A Tree Search Memory (TSM) Arbiter: There are a number of shared
internal and external memory locations available to each GxH. Since this
memory is shared an arbiter is used to control access to the memory. The
TSM can be accessed directly by the picocode, which can for example be
used to store aging tables in the TSM. Also, the TSM will be accessed by
the TSE during tree searches.
[0209] The Completion Unit (CU): The Completion Unit performs two
functions. First, it interfaces the N Processing Units to the UP and Dn
EDS (Enqueue, Dequeue and Schedule Island). The EDS performs the enqueue
action: a frame address, together with appropriate parameters called the
FCBPage, is queued in either a transmission queue, a discard queue, or a
dispatcher queue. Second, the Completion Unit guarantees frame sequence.
Since it may happen that multiple GxHs are processing frames that belong
to same flow, precautions must be taken that these frames are enqueued in
the up or dn transmission queues in the right order. The Completion Unit
uses a label that is generated by the Classifier Hardware Assist upon
frame dispatch.
[0210] Classifier Hardware Assist: For up-frames, the Classifier Hardware
Assist provides a classification for well known cases of frame formats.
Classification results are passed to the GxH, during frame dispatch, in
terms of the CIA and contents of one or more registers. For dn-frames,
the Classifier Hardware Assist determines the CIA, depending on the frame
header. For both up and cin frame dispatches, the Classifier Hardware
Assist generates a label that is used by the Completion Unit to maintain
frame sequence.
[0211] Up and dn DataStore Interface and arbiter: Each GxH has access to
the up and dn DataStore: read access is provided when reading "more Fish"
and write access is provided when writing back the contents of the
FishPool to the DataStore. Since there are N Processing Units, and only
one of them at a time can access the up DataStore and one at a time can
access the dn DataStore, one arbiter for each DataStore is required.
[0212] WEB Arbiter and WEBWatch interface: The WEB Arbiter arbitrates
among the GxHs for access to the WEB. All GxHs have access to the WEB,
which allows access all memory and registers functions in Interface
device. This allows any GxH to modify or read all configuration areas.
The WEB can be thought of as the Interface device memory map.
[0213] The WEBWatch interface, provides access to the entire WEB from
outside the chip using 3 chip-IOs.
[0214] Debug, Interrupts and Single Step Control: The WEB allows the GCH
or WEBWatch to control each GxH on the chip when necessary. For example,
the WEB can be used by the GCH or WEBWatch to single step instructions on
a GxH.
[0215] An embedded general purpose processor, like a PowerPC.
[0216] There are three types of GxH (FIG. 12B):
[0217] GDH (General Data Handler). There are eight GDHs. Each GDH has a
full CLP with the five coprocessors (which are described in the next
section). The GDHs are mainly used for forwarding frames.
[0218] GCH (Guided Cell Handler). The GCH has exactly the same hardware as
a GDH. However, a guided frame can only be processed by the GCH. It is
programmable on the WEB (CLP_Ena register) if the GCH is enabled to also
process dataframes (in which case it takes the role of a GDH). The GCH
has additional hardware compared to the GDH: hardware assist to perform
tree inserts and deletes. The GCH is used to execute guided-cell related
picocode, perform chip and tree management related picocode like aging
and to exchange control information with the CP and/or another GCH. When
there is no such task to perform the GCH will execute frame forwarding
related picocode, and in this case behaves exactly like a GDH.
[0219] GTH (General Processor Tree Handler). The GTH has access to the
hardware mailbox that connects to the PowerPC. The GTH has additional
hardware assist to perform tree inserts, tree deletes and rope
management. The GTH will process dataframes when there are no frames
(containing tree management commands) in the GPQ.
[0220] The number of GxHs (ten) is a "best-guess" Performance evaluation
will determine how much GxH are really required. The architecture and
structure is completely scaleable towards more GxH and the only
limitation is the amount of silicon area (which should then also include
a larger arbiter and instruction memory).
[0221] Each GxH is structured as shown in FIG. 12C. In addition to the CLP
with General Purpose Registers (GPR) and Arithmetic Logic Unit (ALU),
each GxH contains the following give coprocessors:
[0222] (DS) Coprocessor Interface. Interfaces to the Dispatcher and to the
sub-islands that provide read and write access to the up and dn
DataStores. The DS Interface contains the so called FishPool.
[0223] The Tree Search Engine Coprocessor (TSE). The TSE performs searches
in the trees, and also interfaces to the Tree Search Memory (TSM).
[0224] Enqueue Coprocessor. Interfaces the Completion Unit Interface and
contains the FCBPage. This Coprocessor contains a 256-bit register with
additional hardware assist that the picocode must use to build the
FCBPage, which contain the enqueue parameters. Once the FCBPage is built,
the picoproiessor can execute an enqueue instruction, which causes this
coprocessor to forward the FCBPage to the Completion Unit.
[0225] WEB InterfaceCoprocessor. This coprocessor provides an interface to
the WEB Arbiter and allows reading and writing to/from the Interface
device WEB.
[0226] Checksum Coprocessor. Generates checksums on frames stored in the
Fishpool (described hereinafter).
[0227] The Processing Units are shared between ingress processing and
egress processing. It is programmable how much bandwidth is reserved for
ingress processing versus egress processing. In the current
implementation, there are two modes: 50/50 (i.e. ingress and egress get
the same bandwidth) or 66/34 (i.e. ingress gets twice as much bandwidth
as egress).
[0228] Operation of the Processing Unil:s is event-driven. That is, frame
arrival is treated as an event, as well as popping of a timer or an
interrupt. The dispatcher treats different events in an identical
fashion, though there is a priority (first interrupt, then timer-events
and finally frame arrival events). When an event is handed to a
Processing Unit, appropriate information is given to the Processing Unit.
For frame arrival events, this includes part of the frame header, and
information coming from the hardware classifier. For timer and
interrupts, this includes the code entry point and other information that
relates to the event.
[0229] When a frame arrives on the ingress side, and the number of
received bytes of this frame has exceeded a programmable threshold, the
address of the frame-control-block is written in a GQ.
[0230] When a complete frame has been re-assembled on the egress side, the
frame address is written in a GQ. There are four types of GQ's (and for
each type, FIG. 12B, there is an ingress version and a egress version):
[0231] GCQ: contains frames that must be processed by the GCH.
[0232] GTQ: contains frames that must be processed by the GTH.
[0233] GPQ: contains frames that must be processed by the GPH.
[0234] GDQ: contains frames that can be processed by any GDH (or GCH GTH
when they are enabled to process dataframres). For the GDQ, there are
multiple priorities, whereby frames enqueued in a higher priority GDQ
will be processed before frames enqueued in a lower priority queue.
[0235] Some Processing Units may be specialized. In the current
implementation, there are four types of Processing Units (GxH) (see also
FIG. 12B):
[0236] GDH (General Data Handler). The GDHs are mainly used for forwarding
frames.
[0237] GCH (Guided Cell Handler). The GCH has exactly the same hardware as
GDH. However, a guided frame can only be processed by the GCH. It is
programmable on the WEB (CLP_Ena register) if the GCH is enabled to also
process dataframes (in which case it takes the role of a GDH).
[0238] GTH (General Tree Handler). The GTH has additional hardware
compared to the GDH/GCH: hardware assist to perform tree inserts, tree
deletes and rope management. The GTH will process dataframes when there
are no frames (containing tree management commands) in the GPQ.
[0239] GPH (General PowerPC Handler). The GPH has additional hardware
compared to the GDH/GTH. The GPH interfaces to the embedded PowerPC by
means of a mail-box interface.
[0240] In an actual implementation, the role of GCH, GTH and GPH can be
implemented on a single Processing Unit. For example one implementation
could have one Processing Unit for GCH and GPH. A similar comment holds
for the GCQ, GTQ and GPQ.
[0241] The purpose of the Datastore Coprocessor is:
[0242] To interface to the Up DataStore, which contains frames that have
been received from the media, and the Down DataStore, which contains
reassembled frames received from the Prizma Atlantic.
[0243] The Datastore Coprocessor also receives configuration information
during the dispatch of a timer event or interrupt.
[0244] The Datastore Coprocessor is able to calculate checksums on frames.
[0245] The Datastore Coprocessor contains a FishPool (that can hold 8
fish), a ScratchMem (that can hold 8 fish) and some control registers to
read/write FishPool contents from/to the up or down datastore. The
FishPool can be seen as some kind of work area for the Datastore: instead
of reading/writing directly to a Datastore, a larger amount of frame data
is read from the) Datastore into the Fishpool or a larger amount of data
is written from the Fishpool into the Datastore. The unit of transfer is
a Fish, which equals 16 Bytes.
[0246] The Fishpool can be seen as a memory that can contain 8 fish, that
is 8 words of 128 bit each. In the CLP processor architecture, the
Fishpool is a register array of 128 bytes. Each byte in the Fishpool has
a 7-bit byte address (0 . . . 127) and access is on a 16-bit or 32-bit
basis. Like all register arrays, the Fishpool has a circular addressing
scheme. That is, addressing a word (i.e. four bytes) starting at location
126 in the Fishpool returns bytes 126, 127, 0 and 1. Furthermore, from a
Datastore Coprocessor point of view, fish-locations in the Fishpool have
a 3-bit fish-address.
[0247] Upon frame dispatch the first N fish of a frame are automatically
copied in the Fishpool by the Dispatcher. The value of N is programmable
in the PortConfigMemory. Typically, N equals four for up frame dispatch,
2 for dn unicast frame dispatch, 4 for dn multicast frame dispatch and 0
for interrupts and timers.
[0248] The picocode can read more bytes from a frame, in which case the
Datastore Coprocessor automatically reads the frame data into the
fishpool at the next fish address, wrapping automatically to 0 when the
boundary of the Fishpool has been reached. Also, the picocode can read or
write the up/clown datastore at an absolute address.
[0249] The WEB Coprocessor interfaces to the EPC WEB Arbiter. The EPC WEB
Arbiter Arbitrates among the ten GxH and the WEB Watch to become a master
on the Interface device WEB interface. This allows all GxH to read and
write on the WEB.
[0250] The interface device memory complex provides storage facilities for
the Embedded Processing Complex (EPC) FIG. 12A. The memory complex
includes the Tree-Search Memory (TSM) Arbiter and a plurality of on-chip
and off-chip memories. The memories store tree structures, counters and
anything else that the pico code requires memory access for. Furthermore,
the memories are used to store data structures that are used by the
hardware, like free lists, queue-control-blocks, etc. Any memory location
which is not allocated for trees or which is not allocated for trees or
which is not used by the hardware is by default available for pico code
use, like counters and aging tables.
[0251] FIG. 13 shows a more detailed block diagram of the memory complex.
The tree-search memory (TSM) arbiter provides the communication link
between the Embedded Processors (GxH) and the memories. The memories
include 5 on-chip SRAMs, 1 off-chip SRAM, and 7 off-chip DRAMS. The TSM
Arbiter includes ten Request Control Units (each one connected to one of
the Embedded Processor GxH) and 13 memory arbiter units, one for each
memory. A bus structure interconnects the Request Control Units and the
arbiter units in such a way that each control unit and its connected GxH
have access to all memories.
[0252] The control unit includes necessary hardware to steer data between
the Embedded Processor (GxH) and the arbiters.
[0253] The SRAM arbiter units, among other things, manage the flow of data
between the Embedded Processor GxH and the on-chip and off-chip SRAMs.
[0254] The DRAM Arbiter Units, among other things, manages the flow of
data between the Embedded Processor (GxH) and the off-chip DRAM devices.
[0255] Each Memory Arbiter contains a "back-door" access, which is
typically used by other parts of the chip and has highest access
priority.
[0256] The DRAM Memories can run in two modes of operation:
[0257] TDM-mode. Memory access to the four banks in the DDRAM is done
alternating read-"windows" and write-windows, whereby in a read window,
access to any of the four banks is read-only and in a write window,
access to any of the four banks is write only. Using TDM-mode for
multiple DDRAMs allows to share some control signals between the DDRAMs
and hence this saves some chip IOs (which is a very scarce resource).
[0258] Non-TDM-mode. Memory access to the four banks in the DDRAM can be a
combination of read and write (which must follow some rules described in
the DDRAM Arbiter Disclosure). E.g., one can do a read in bank A and a
write in bank C within an access window.
[0259] Allows N Requesters simultaneous access to M memories. When
multiple Requesters want to access the same memory, a round-robin
arbitration is performed.
[0260] The M memories can have different properties. In our current
implementation, there are three memory types: internal SRAM, external
SRAM and external DDRAM.
[0261] The M memories and N Requesters are homogeneous: any Requester can
access any memory.
[0262] Some memories are logically divided into multiple sub-memories
(like four banks in the DDRAM), which can be logically accessed
simultaneously.
[0263] Part of the M memories are used for control memories containing
internally used data structures, which have a high priority access
compared to the picoprocessors. This also allows debugging of the chip,
since the picoprocessors can read the contents of the control memories.
[0264] The arbiter supports read access, write access and read-add-write,
whereby an N-bit integer is added to the contents of the memory in an
atomic operation.
[0265] A general address scheme is used to access the M memories, such
that the physical location of an object in the memory is transparent.
[0266] The concept of trees as used by the Tree Search Engine to store and
retrieve information. Retrieval, i.e., tree-searches and also inserts and
deletes are done based on a Key, which is a bit-pattern like, for
example, a MAC source address, or the concatenation of an IP source
address and IP destination address. Information is stored in a control
block called Leaf, which contains at least the Key (as will be seen
later, the stored bit pattern is actually the hashed Key). A leaf can
also contain additional information, like aging information, or user
information, which can for example be forwarding information like target
blade and target port numbers.
[0267] There are tree types (FM, LPMA and SMT) and associated tree type
searches, namely: fixed match, software managed tree and largest prefix
match. An optional additional criterium for checking the leaf during a
tree search is the VectorMask. Roping, aging and a latch are used to
increase search performance.
[0268] The search algorithm for FM trees is shown in FIG. 14. The search
algorithm operates on input parameters, which include the Key, performs a
hash on the Key, accesses a Direct Table (DT), walks the tree through
Pattern Search Control Blocks (PSCBs) and ends up at a Leaf (FIG. 14).
There are three types of trees, each with its own search algorithm, which
causes the tree-walk to occur according to different rules. For example,
for Fixed Match (FM) trees, the datastructure is a Patricia Tree. When a
Leaf has been found, this Leaf is the only possible candidate that can
match the input Key. For Software Managed Trees, there can be multiple
Leafs that are chained in a linked list. In this case, all Leafs in the
chain are checked with the input Key, until a match has been found or
until the chain has been exhausted. A so-called "compare at the end"
operation, which compares the input Key with the pattern stored in the
Leaf, verifies if the Leaf really matches the input Key. The result of
the search will be OK when the Leaf has been found and a match has
occurred, or KO in all other cases.
[0269] The input to a search operation consists of the following
parameters:
[0270] Key (128 bits). The Key must be built using special picocode
instructions prior to the search (or insert/delete). There is only one
Key register. However, after the tree search has started, the Key
register can be used by the picocode to build the key for the next
search, concurrently with the TSE performing the search. This is because
the TSE bashes the Key and stores the result in an internal HashedKey
register (thus, in reality, there are 2 Key registers).
[0271] KeyLength (7 bits). This register contains the length of the Key in
bits. It is automatically updated by hardware during building of the Key.
[0272] LUDefindex (8 bits). This is an index into the LUDefTable, which
contains a full definition of the tree in which the search occurs. The
LUDefTable is described in detail later.
[0273] TSRNr (1 bit). The search results; can be stored either in Tree
Search Result Area 0 (TSRO) or TSR1. This is specified by TSRNr. While
the TSE is searching, the picocode can access the other TSR to analyze
the results of a previous search.
[0274] Vectorlndex (6 bits). For trees which have the VectorMask enabled
(which is specified in the LUDefTable), the Vectorindex denotes a bit in
the VectorMask. At the end of the search, the value of this bit is
returned and can be used by picocode.
[0275] The input Key will be hashed into a HashedKey, as shown in FIG. 14.
There are six fixed hash algorithms available (one "algorithm" performs
no hash function). It is specified in the LUDefTable which algorithm will
be used. A programmable hash function may be used to add flexibility.
[0276] The output of the hash function is always a 128-bit number, which
has the property that there is a one-to-one correspondence between the
original input Key and the output of the hash function. As will be
explained below, this property minimizes the depth of the tree that
starts after the Direct Table.
[0277] If colors are enabled for the tree, which is the case in the
example of FIG. 14, the 16-bit color register is inserted in the 128-bit
hash function output. The insertion occurs directly after the Direct
Table. I.e., if the Direct Table contains 2.sup.N entries, then the
16-bit color value is inserted at bit position iN, as shown in the
figure. The output of the hash function, together with the inserted color
value (when enabled), is stored in the HashedKey register.
[0278] The hash function is defined such that most entropy in its output
resides in the highest bits. The N highest bits of the HashedKey register
are used to calculate an index into the Direct Table (DT).
[0279] The search starts with an access into the Direct Table: a DTEntry
is read from the direct table. The address used to read the DTEntry is
calculated from the N highest bits of the HashedKey, as well as on
tree-properties as defined in the LUDefTable. This is explained in detail
below. The DTEntry can be seen as the root of a tree. The particular tree
datastructure that is used depends, on the tree-type. At this point it
suffices to say that a Patricia Tree datastructure is used for FM trees,
and extensions to Patricia Trees for LPM and SMT trees.
[0280] An example of the use of an 8 entry DT is shown in FIG. 15. It can
be seen that the search time (i.e., the number of PSCBs that must be
accessed) can be reduced by using a DT. Thus, by increasing the DT size,
a trade-off can be made between memory usage and search performance.
[0281] As can be seen from FIG. 15., a DTEntry can contain the following
information:
[0282] Empty. There are no Leafs attached to this DTEntry.
[0283] A pointer to a Leaf. There is a single Leaf attached to this
DTEntry.
[0284] A pointer to a PSCB. There are more than one Leafs attached to this
[0285] DTEntry. The DTEntry defines the root of a tree.
[0286] The Search Algorithm for a software managed tree and algorithm for
generating the tree is set forth in U.S. patent application Ser. No.
09/312,148 and is incorporated herein by reference.
[0287] An algorithm termed "Choice Bit Algorithm" uses a certain metric to
build a binary search tree based upon bits selected from items termed
"rules" in a set or universe of rules. All our examples are couched in
terms of Internet Protocol (IP) headers, but a fixed format header of any
type could be used instead.
[0288] In IP, each Rule pertains to certain Keys which might be built with
the following subsections: Source Address (SA), Destination Address (DA),
Source Port (SP), Destination Port (DP), and Protocol (P). These data are
respectively 32, 32, 16, 16, and 8 bits long and so a Key to be tested
consists of 104 bits. The Choice Bit Algorithm finds certain of the 104
bits which are especially useful. Testing the few bits in effect
eliminates all but one or all but a few rules from possible application.
For some rules, testing inequalities by means of simple compare
operations are also appropriate. The bit tests and compares are logically
organized in a binary tree. The tree is mapped into a hardware enabled
structure that tests bits at high speeds. Such testing results in just
one rule or a small number of rules (called a leaf chain) which the Key
might fit. In the former case, the Key is then tested in full by the
rule. In the latter case, the Key is then tested in a lattice of tests
using compares and full rule tests.
[0289] Each rule in the rule set is associated with an action which is
taken if the rule is the 1a5-highest priority rule which fits the key.
Rules can intersect (one key fits two or more rules).
[0290] In that case, rules can be given priority numbers 1, 2, 3, . . . ,
so that any two intersecting rules have different priorities (an
administrator must declare which rule dominates if a key fits two or
more). Thus if more than one rule remains to be tested after the bit
tests and compares, the rules are tested in ordeir of priority. A lower
priority number designates a rule with higher priority.
[0291] If no fit is found at all, some default provision may be specified.
[0292] The search algorithm for the longest Prefix Matching method is set
forth in U.S. Pat. No. 5,787,430, incorporated herein by reference. The
method requires entering at a node of said database (root node);
determining a search path from one node to another through said tree-like
database by successively processing segments of said search argument
which comprise only those parts of the entries which are necessary to
identify the next (child) node, and said second link information until
said segments are consumed or a (leaf) node lacking said second link
information is reached; comparing with said search argument an entry
stored in the node al: which said search path ended; and if no at least
partial match between the search argument and said entry is found in said
current node, backtracking said search path by processing said first link
information of said current node; and repeating the previous two steps
until said at least partial match is found or said root node is reached.
[0293] FIG. 16 shows an embodiment of the main switching fabric device.
Preferably, each interface device chip has at least two integrated
parallel-to-serial ports which receive parallel data and convert the data
to a high speed serial data stream which is forwarded over a serial link
to the switching fabric device. Data received from switching fabric
device on a high speed serial link is converted lo parallel data by
another DASL. An embodiment of the Serializer/Deserializer termed Data
Align Serial L1nk (DASL) is described herein.
[0294] At least one DASL interfaces the switching fabric device to the
serial links. Data from the serial link is converted into parallel data
which is delivered to switching fabric device. L1kewise, parallel data
from switching fabric device is converted to serial data which is
delivered to the serial links. The serial links can be aggregated to
increase throughput.
[0295] Still referring to FIG. 16, the switching system includes switch
fabric 11, input switch adapters 13 (13-I . . . 13-k) which are connected
to the switch fabric input ports 15 (15-I . . . 15-k), and output switch
adapters 17 (17-I . . . 17-p) which are connected to the switch fabric at
output ports 19 (19-l . . . 19-p).
[0296] Incoming and outgoing transmission links 21 (21-I . . . 21-q) and
23 (23-I . . . 23-r) are connected to the switch system by line (link)
adapters 25 (25-I . . . 25-q) and 27 (27-I . . . 27-r), respectively. The
transmission links carry circuit switched or packet switched traffic from
and to attached units such as work stations, telephone sets or the like
(links designated WS), from and to local area networksc; (links
designated LAN), from or to Integrated Services Digital Network
facilities (links designated ISDN), or from and to any other
communication systems. Furthermore, processors may be attached directly
to switch adapters 13 and 17. The line adapters (LA) and switch adapters
(SA) have a common interface.
[0297] At the input switch adapters, vairious services from packet
switched and circuit switched interfaces are collected and converted into
uniform minipackets (having one of several possible fixed lengths), with
a header containing routing information designating the required output
port (and outgoing link) of the switch. Some details on the minipacket
format and on minipacket generation in the input switch adapters and on
depacketization in the output switch adapters will be given in the next
sections.
[0298] The switch fabric routes the minipackets via a fast self-routing
interconnection network from any input port to any output port. The
structure of the self-routing network is such that minipackets can be
routed simultaneously internally without any conflicts.
[0299] The heart of the switching system is the switch fabric. Two
different implementations are considered and will be described
separately. In one implementation, the switch fabric comprises a
self-routing binary tree for each input port, connecting the respective
input port to all output ports; the switch fabric comprises k such trees
in combination (if k input ports are provided). In the other
implementation, a bus structure with an output RAM is provided as a slice
for each output port, connecting all input ports to the respective output
port; the switch fabric comprises p such slices in combination (if p
output ports are provided).
[0300] In the drawings and specifications there has been set forth a
preferred embodiment of the invention and, although specific terms are
used, the description thus given uses terminology in a generic and
descriptive sense only and not for purposes of limitation.
[0301] DASL is described in application Ser. No. 09/330,968, filed Jun.
11, 1999 and incorporated herein by reference. The DASL Interface
receives data from a parallel interface such as a CMOS ASIC, partitions
the bits from the parallel interface into a smaller number of parallel
bit streams. The smaller number of parallel bit streams are then
converted into a high speed serial stream, which is transported via a
transmission medium to the receiver of the other module. A differential
driver with control impedance drives the serial bit stream of data into
the transmission media.
[0302] DASL implements the method of parsing a data stream presented as N
bits in parallel into a plurality of portions each having n bits, wherein
n is a fraction of N; serializing each n bit portion of the data stream;
transferring each serialized portion over a corresponding one of a
plurality of parallel channels; and deserializing each transferred
portion of the data stream to restore the data stream to presentation as
N bits in parallel.
[0303] In the drawings and specifications there have been set forth
preferred embodiments of the inventions here disclosed and, although
specific terms are used, the description thus given uses terminology in a
generic and descriptive sense only and not for purposes of limitation.
* * * * *