A Practical Bytecode Interpreter for Programmable Routers on a Network ProcessorS. Martina,∗, G. LeducaaResearch Unit in Networking, Institut Montefiore, 4000 Liège, BelgiumAbstractWASP is a programmable router platform that allows end-hosts to store ephemeral state in routers along the path of IP flowsand to execute packet-attached bytecode that processes this data. We exploit lessons from past active network research and ourknowledge of network processors to design a minimal interpreter that favours language restrictions over run-time checks. WASPprovides safety with limited performance penalty through predictable execution time and bounded usage of memory and networkresources. WASP is expressive enough to enable several applications including statistics collection and service discovery. It canalso detect common trunk of two Internet paths and exchange local measurements about these paths.We propose a robust implementation on the IXP2400 network processor, and evaluate its performance through short benchmarkprograms against native functions hard-coded in the router. We achieve latencies below 7µs, i.e. less than the reference IPv4forwarding latency, and throughputs approaching 800 kpps per core, which competes with, and sometimes even outperforms, nativeprograms. We further exploit our results to give hints on further improving resource usage and guidelines on management ofephemeral stores in high-speed networks.Key words: bytecode programmable router performance analysis network processor1. Introduction During the last decade the active network research activitieshave fundamentally revised the paradigms on which we buildOn today’s Internet, multimedia applications have become computer networks to offer more flexibility, such as the abilitya daily reality for the average user. Be it socially-shared videos, to perform application-specific forwarding decisions based oncrawling about massive amount of annotated pictures, video network status (congestion, queue levels, available bandwidth,calls or online gaming where you can vocally tease your op- etc.). In a typical active router, each packet carries or identi-ponent, the fact is that Internet is going away from its mostly fies a piece of code that defines either the complete forwardingtext-based structure and now focuses on other delay-sensitive procedure, or some custom code to be applied in pre-definedforms of media. In the meantime the progression of wireless hooks of an otherwise static forwarding procedure. In WASPaccess networks offers a growing variety of connectivity, and (World-friendly Active packets for ephemeral State Process-potentially multi-homing possibilities. Since most services on ing), we explore a way to extend router functionality in order toInternet are already replicated at multiple locations, an applica- provide support for network measurements and properties dis-tion is likely to get even more potential paths to connect to a covery. Unlike the vast majority of previously proposed activereplica. Each path will have its own properties, and could be platforms, WASP does not attempt to provide a fully expressivepreferred or not for a given kind of transfer, which increases the programmable network, but rather restricts the programmingneed for sensing the state of the network. model to achieve efficient and safe processing.The currently deployed architecture, however, provides lit- Given those restrictions, WASP offers a flexible service thattle support for such investigations of the network’s “health”, can be used in conjunction with orthogonal services such asand the simple task of figuring out the common trunk between multicast or multi-path forwarding, rather than a catch-all solu-a host and two of its partners involves several probing packets tion. For instance, WASP cannot deflect traffic around a con-to unveil a part of the network topology. The current trend is gested link, but it enables us to scalably publish the observationto overcome those limitations of the network by more sophis- of congestion along a path so that other flows considering a pos-ticated interactions between end-systems, collaborating for ac- sible switch to that path would know in advance whether theirquiring network statistics and sharing observations in a peer-to- performance could be improved.peer fashion. While it is a perfectly valid approach under manyaspects, it certainly leads to additional burden on the network Related Workas each application performs redundant active measurements tocompensate for the lack of support from the network. Most of the research activities to run active code on networkprocessors so far have focused on building custom services bychaining pre-installed and operator-approved modules [1, 2, 3].∗corresponding author.Email addresses: martin@run.montefiore.ulg.ac.be (S. In [4], the authors present a prototype of the Ephemeral StateMartin) Processing (ESP) router on IXP1200 network processors, wherePreprint submitted to Elsevier July 5, 2010end-systems may only use a restricted number of operations Similarly, the ESS offers only best-effort privacy, and it isthat are performed on the router state store. Our work is clearly up to the end-systems to agree on the 64-bit key(s) they will usea derivative of this work where we replace pre-defined opera- so that no adversary could eavesdrop it. Even without that/thosetions by a bytecode interpreter. key(s), one could still generate random keys in a brute-forceComparatively an implementation of SNAP [5] on the IBM way to try and corrupt existing state. However, since the statePowerNP [6] has explored the feasibility of a just-in-time com- is kept only for 10 seconds, an attacker that manages to send 1.4piler for SNAP bytecode. While the results speak in favour million WASP packets per second (i.e. saturating a 1Gbps linkof compilation (rather than interpretation), this only holds for with his attack) would only alter existing state with probabilityprograms using loops or if the code generated for one packet lower than 2−40.can be reused for subsequent packets. We argue that, given Furthermore the ESS can ease several monitoring tasks suchthe limited size of the instruction store in network processors as evaluating the jitter of a given packet flow. The local time t0(and especially in the Intel IXP family), compilation should be observed when packet P0 crosses the router is written in thea way to optimise most frequent packets rather than the default ephemeral store. The next packet P1 can compare this with thebehaviour. observed local time t1 and build average, maximum and mini-We also need to mention StreamCode [7], which also pro- mum values over a few packets. A special packet then collectsposed programmability through a dedicated processor on FPGA those values in each router. Other statistics such as the depthsuited to packet forwarding routines, using ’co-processors’ for of output buffers, transmission errors or access to the networkdata copy and routing table lookup. A significant design choice packet header will of course enable more applications.in StreamCode is that the code in packets is in charge of allo- Proper operation of virtually any ESS-based protocol reliescating buffers and decides the next hop for each packet. This on the assumption that the network can be trusted to deliverlevel of control over the router and network resources implies packets only to their intended recipients. In some access orthat StreamCode should not originate from the end-systems, but local networks this may require link-level encryption betweenwould rather be used by a smart border gateway to enforce spe- hosts and access routers to avoid blatant information harvestingcific QoS behaviours [8]. and impersonation.Document Structure 2.2. World-FriendlinessThe paper is organised as follows: Section 2 presents the Our aim with WASP is to find the right balance between ex-principles that have guided the design of the WASP platform. pressiveness, efficiency and safety in processing of application-Section 3 discusses the platform itself, including the trade-off specific code in the network. We use the term “world-friendly”between expressiveness and safety of operations. We imple- to refer to this triple objective of (1) user-friendliness: WASPmented a prototype of our WASP interpreter on the IXP2400 should provide enough flexibility and a clear model of the of-network processor and present a performance evaluation against fered service where the end-system keeps control on the ser-the ESP filter prototype in section 4. We finally propose scal- vice their packet should receive; (2) router-friendliness: WASPability improvements and guidelines for deployment on high- makes sure that router resources are correctly used, and thatgrade routers in section 5. (possibly malicious) bytecode have predictable execution timeand cannot exhaust router memory; and (3) network-friendli-ness: WASP sticks to the network transparency model of IP,2. Design Principles disabling any “surprising” behaviour such as cloning of pack-ets or altering source/destination fields. The transport protocol2.1. Ephemeral Store on a Routerand application data are also out of reach and the only part ofThe Ephemeral State Store (ESS) is a corner stone in the the packet WASP can alter is its own “scratchpad” to recorddesign of WASP, which has been inherited from former ESP state observed in the network.(Ephemeral State Processing [9]) filter. It is a (key, value)repository associated with a network interface where packets 2.3. The Case for Cooperationcan drop and inspect state. Because all entries in the ESS have Even if we perfectly fulfil the world-friendliness objectives,the same size (64-bit key and value) and are allocated for a fixed any addition of functionality to the network is only of interesttime period τ (namely 10 seconds), it is possible to engineer the for a network operator if it can bring some added-value to hisstate store so that it can always satisfy a request for a new slot, business. WASP mainly allows to trade-off processing powereven at peak rate. and temporary storage against bandwidth, in the same way aIn [4] the authors illustrate this principle on an IXP1200 transparent Web proxy does, which – depending on an operatornetwork processor. With at most 105 packets per second, each – may reduce operational expenses or not.allowed to create 2 ESS entries that last for 10 seconds, a 46MB Unlike transparent proxies however, WASP requires coop-ephemeral store never gets full. Access to the ephemeral store eration from end-systems, where applications will attach WASPcan thus be made available to any end-system, without prior programs and extract results, but it offers a richer set of interac-authentication or specific accounting, just like IP forwarding is tions. It is possible for instance to efficiently locate third-partyoffered to every machine connected to the network. services along a path, control one’s flow rate or locate peer sys-tems that own a replica of a wanted piece of data.2Figure 2: A WASP router and WASP execution environment. Gray items meanthe VPU has read-only access to the resourcemicro-engines or MEs) for fast-path functions and a StrongArmfor control and management. We can also note the presence ofFigure 1: IXP2400 network processor structure diagram, annotated with tim- memory units of different technologies, each with its own ac-ings reported by [14]cess characteristics (such as commodity DRAM with high ac-cess latency but efficient burst transfers) and size limits. Pro-Locating available members of a peer-to-peer application grams on the micro-engines are directly exposed to those hard-(e.g. grid computing, file sharing, maintaining virtual coordi- ware details, and optimising the placement of data structuresnates) in a decentralised fashion for instance, typically requires on the right kind of memory is usually required to achieve goodscanning large parts of the Internet (e.g. using randomly gen- performance.erated addresses [10]) before a contact node can be found. We A determining aspect of NPU programming is the pursuithave shown previously [11] that having 10% of a community of wire speed processing. In other words we want to ensureconnected through routers offering ephemeral storage is enough that our network device is capable of fully utilising the outputto find a contact node with only a small number of probes. links when it receives enough traffic on its input links, evenMoreover peers that are attached through a WASP router will when all packets have minimal size [1]. If a router fails to meetalmost immediately discover peers running in the same ISP, po- this requirement, an attacker can easily deny routing to othertentially reducing long-distance traffic. flows with a traffic volume that the router should handle with-We also previously illustrated the use of WASP and ephe- out problems. In order to keep the output link busy even whenmeral store to allow an operator deploying an application-speci- the processing time exceeds the transmission time of a minimalfic service (such as a proxy/depot, a media transcoder or a sys- packet, NPUs typically feature massively parallel architecturestem for merging results of grid computations) to make the ser- that allows for pipelining of sub-tasks and parallel execution onvice visible to end-systems establishing a connection through different packets. On the IXP2xxx, each micro-engine has 8 in-the network [12]. dependent hardware contexts and an arbiter that allow anotherWe believe that in the context of an increasingly flattening context to be executed on the ME while the previous context isInternet [13], where major content providers are pushing their waiting for a memory or I/O access to complete.WAN closer to the connectivity providers, a generic mechanismfor enabling such interactions could be both technically viable 3. The WASP Platformand economically sound, especially as it allows one to decou-ple hardware upgrades from application-level innovations and The WASP router combines an unmodified forwarding core,improvements. It may also be useful for an operator that has surrounded by WASP filters, each associated with a networkspecial interest e.g. in grid computing and wants to differen- interface. A packet crossing the router is thus first processedtiate his offer with additional services, but has no direct way by the WASP filter associated with the receiving interface, thento remotely configure customers’ software (e.g. a national re- handled by the forwarding core (which will e.g. lookup thesearch & education network). forwarding table and dispatch packet to the proper output card),and finally processed by the WASP filter associated with the2.4. IXP Network Processors output interface, as shown on Fig. 2.Network processors (NPUs) are programmable chips de- This overall design is inherited from the Ephemeral Storesigned to replace the dedicated circuits that used to equip line- Processing (ESP) router [9] and has strongly helped in reachingcards in routing equipments. The term covers a large diversity network-friendliness. As the core of the router still operates likeof designs, but on most of them, we can highlight the presence a regular router forwarding engine, the only notable difference,of a control core (typically an embedded RISC processor), and as far as forwarding is concerned, is that filters introduce thea series of data-processing cores to perform custom functions possibility for a packet to be dropped pro-actively (just as if aon every packet. The chip usually also features co-processors queue was full) or anticipatively be returned to its source (justthat assist the other units in specific tasks such as checksums, as if the TTL limit was reached).data transfers, lookups, encryption, etc. Each WASP filter consists of a dedicated ephemeral stateIn this paper we will focus on the Intel IXP2400 depicted store (ESS) and a virtual processor (VPU) executing WASPon Fig. 1, which features eight micro-programmed cores (the3programs embedded in packets passing through the filter. Dur- variables will indicate e.g. the IP address of the node, a node-ing its execution, a WASP program only has access to its own local timestamp, rough state of the output queues of an inter-scratchpad SP (packet-carried variables) and a set of ESS en- face (to allow packet “sensing” congestions). They could alsotries (E1, E2, . . .) identified by the keys carried with the packet. indicate whether a given link is wireless, its error rate and peakThe sole effect of the WASP (or ESP) filter is to transform these bandwidth. We tried to keep such information minimal, as net-entries into S′ , E′ , E′ , . . . and to decide on the packet’s fate. work operators are typically reluctant to disclose any informa-P 1 2All other state on the router (e.g. FIB, MIB, queues) and in the tion about the internals of their network. As a rule of thumb wepacket (forwarding headers and payload) remain unchanged. suggest that information is eligible for environment variablesWASP departs from the ESP design by expressing the func- only if it is already possible for end-systems to infer that infor-tion that performs this transformation in the packet itself, via a mation through active measurements.bytecode program that will be interpreted on a virtual proces- Finally we extended the semantics of the ESS itself. Whilesor associated with the ESS. Comparatively, in the original ESP entries in ESP always have public visibility, WASP also sup-packets just contained operands and an opcode selecting one of ports protected entries that can be modified only by the net-the pre-compiled transformation function – some of which re- work operators and optional protocol-private entries that arequiring over 20 lines of pseudo-code for their description. accessible only to packets using a specific program. As weWe also decided to drop ESP’s “central” store, which could shown previously [11, 12] this enables applications operatingtheoretically process every packet crossing the router, to ensure on more sensitive data such as locations where traffic could bea scalable architecture where extra functionalities are solely redirected.provided on line cards. We compensate this limitation through“admin packets” that may originate only from management sta- 3.2. Building for Safetytions of a given network domain and that are given the ability The challenge of safety in active networks is essentiallyto commit a write in all the state stores of a node. twofold: execute third-party code without putting the routerCompared to most interpreters featured in active routers, the at risk (e.g. ensure there will be CPU and memory to sustainVPU found in WASP is extremely simple and lightweight. We submitted packets) and allow applications to control their flowrefined memory access opcodes and ALU workflow to allow without raising threats on the network (e.g. avoid those pro-both compact encoding of programs and efficient execution1, grams to flood hosts/links). Though we exploited the resultswhile keeping the interpreter compact enough to fit in a sin- and guidelines of previous active network research (e.g. [16]),gle microengine of an IXP processor. Unlike most virtual ma- most of the safety issues were fortunately simplified by VPUchines, the VPU has no heap to manage, only performs one- design, removing the need for run-time checks.cycle arithmetic operations on integers (e.g. no float, no divi- Network resources usage is at worst similar to the regularsions) and its whole state is reset every time a new packet is use in IP networks, thanks to the filter-based organisation: ei-processed. ther the filter drops the packet, or it lets the legacy core handleit. If we ensure that return operations can happen only once3.1. Improving Utility for each packet, WASP trivially never introduces loops in theThe original ESP router allowed interesting interactions via network, nor does it flood (or help flood) hosts and links. Wethe ephemeral state store, mainly illustrated in the context of re- should stress however that if we attempt to extend the func-liable multicast transmission [15]. Yet, the programming model tionalities of WASP, we need to ensure that we do not breakremained tedious to use and master. For a given problem where this property. While we initially planned to allow WASP pro-the designer has the feeling Ephemeral State Store could be grams to alter the destination of their packets in a restricted wayhelpful, it is at best unclear how to combine available functions [17], it turned out that this could not be implemented safely un-(in a sequence of packets) to achieve that goal. In many other less forwarding infrastructure guarantees that packets originatesituations we just experience the frustration that the operations from their respective source address.defined are too specialised to be of any use. Similarly the simplicity of the VPU and its minimal setIn contrast a bytecode-based encoding of the functions al- of opcodes allow us to guarantee that processing time of anylows the application designer to express the exact operation to packet P is O(|P |), where |P | is the length of the WASP pro-be performed by each WASP packet. We also provided new gram attached to P in bytes. This is possible thanks to the fact“micro-operations” through the bytecode interpreter that were that we forbid backward jumps as in SNAP [5], allowing onlynot found in any previous ESP function, such as the ability for if-else blocks, but no (wild) loops. By also forbidding complexa packet to return to its source. arithmetic instructions, we ensure that all microbytes2 can beTo further extend the utility of WASP compared to ESP, processed in a similar time, which prevents attacks using an ab-we allow the bytecode to access per-interface and basic router normally high amount of “heavy” opcodes (such as div). If wesetup information (depicted on Fig. 2 as “environment vari- enforce a restriction on the number of ESS references a programables” memory bank) and network-layer packet header. Such can issue, the code length (not its content) is thus sufficient topredict execution time.1Mainly through optimising for sequential access to packet variables2With the notable exception of ESS access instructions4The most challenging checks to ensure safety of an activerouter come from memory protection and the need to avoidrouter state corruption or information disclosure. This is a spe-cific issue that should not be confused with access control toapplication-maintained state. In most active networks, the vir-tual machine processing application-specific code has accessto sensitive router state such as forwarding tables or packet Figure 3: Internal structure of our WASP/ESP packet filter. 4 MEs doqueues. This access can either be direct (e.g. the data sits in the WASP/ESP processing and we have one “spare” ME.VM’s address space) or indirect (the VM relays objects betweentrusted address spaces), and designers mostly rely on stronglyof the two paths where traffic towards A and B may inter-typed languages to avoid abuses through wild pointers.fere, or directly return the address of R (the last commonIn contrast the VPU has an extremely small address spacerouter) where deployment of a duplication/merging func-(256 bytes), which only contains data that WASP bytecode is al- tion could be interesting.lowed to access. Packet-bound variable are not allowed to growbeyond the room initially allocated by the sender, and node- Rate Control Given that routers export a hint on the depth ofresident data are only accessed through the key/value interface output queues, we can program packets that drop them-of the ESS. The absence of any type-checking in the VPU is selves if the router appears congested, making room forpossible because the VPU ignores any type except integers, and “more important” data in the same flow4. WASP also of-because we leave the organisation of ESS entries and scratch- fers the flexibility to detect losses between any successivepad entirely up to the application. Unlike scripting languages WASP routers and to prepare state so that such losses aredesigned to be used as glue between sensible services, we do reported back to the source by e.g. acknowledgements.not need more typing in the case of WASP. The positive impact of such decisions on the quality ofRather than issuing references, WASP programs look for a video stream has been demonstrated in many formerrandomly generated 64-bit keys, and access control is defined works on active networks [19].on a per-entry basis: either the whole entry is publicly avail-able (and anyone knowing its key can alter any bit of it) or it Service Advertisement WASP packets coming from manage-is an operator-installed, read-only entry. This gives us the extra ment stations in an autonomous system drop the IP ad-advantage that packet-bound variables can be accessed directly, dress of a service provider using a well-known protectedwithout the need for any unmarshalling operation – which may key (e.g. a hash of the service name), so that a modifiedaccount for e.g. up to 42% of the processing time for simple version of the path-scanning program can report to thepackets in Java-based active networks [18]. end-system where the wanted service can be found [12].At each router R where the advertisement is stored, the3.3. Example of Use TTL of the advertisement packet can be used to keep onlythe closest provider from R.We insist on the fact that the WASP programming modelallows the network to mix WASP-capable and legacy routers. Most of these programs can of course be extended or mod-Applications using WASP will thus have to take this into con- ified to better match application needs. Scanning, for instance,sideration and degrade gracefully when we have less “cooperat- is not limited to IP addresses of routers, but can be extended toing” routers. This could mean less information available to the any piece of information available in the ESS or in the environ-end-systems, sub-optimal placement of helper proxies, etc. ment variables memory bank.More sophisticated versions of the common trunk identifi-Scanning path The most obvious WASP application is a trace-cation could use a bitmask in the ESS in order to quickly getroute-like packet that records information about routersa snapshot of a full sink tree. Another possible extension is toit crosses. However, while a regular traceroute programmake use of a shared secret key so that members of a commu-requires at least one packet per hop, WASP can easilynity can mark path on the Internet that are under measurementstore 30 hops in a single packet. It is also possible to(e.g. bandwidth-wise in a peer-to-peer distribution network) toprogram the WASP byte code so that only a portion of theavoid oscillations due to simultaneous probing of a path by twopath is recorded, or that the packet returns after collectingpairs of nodes.n addresses.Common trunk Given three nodes S, A and B, a very sim- 3.4. WASP on Network Processorple use of the ephemeral state allows S to tag routers An application on the IXP is typically split up into severalon path S − A with one packet while a second packet components that will run on the different processing elements.will travel along path B − S and record the addresses of Pipe-line processing is typically assisted by hardware scratchtagged routers3. This can identify the common part S−R rings programmed to relay packet handles and metadata be-tween MEs. Note that the packet content is transmitted directly3This mechanism was initially described in [15], though none of the ESPoperations provide support for it 4E.g. dropping B-frames in favour of I-frames in an MPEG flow [17]5Figure 4: Layout of the Ephemeral Store, showing 3 keys (K1,K2 and K3)hashed to the same value k and chained in the entry table, with their respective Figure 5: Cumulative Distribution of packet forwarding latency at low through-creation time (ctime). put for count instructionfrom I/O buffers into DRAM (by the RX microblock) and only Most our operations will not manipulate just a single 64-bitmeaningful parts will be fetched on demand, e.g. by the classi- value, but rather operate on tuples in the ESS that need to befier microblock CLS. manipulated atomically. We allowed WASP to use two consec-The setup of our proof-of-concept implementation is de- utive entries of the ESS to offer a 32-byte memory bank wherepicted on Fig. 3. We limited ourselves to layer 2 forwarding elements of a tuple can be grouped together under a single key.(plus WASP processing) and the XScale’s role is limited to The map microbyte allows such an entry to be accessed as amicrocode loading, monitoring and debugging. We kept ESP second bank by the VPU. WASP programs rewritten to take ad-and WASP apart, on separate MEs, to ease individual testing of vantage of this alternate access mode are referred to as “/map”each implementation. Each of the 4 MEs doing WASP or ESP in the following section.processing is connected to the classifier with 16 independentqueues and operates its own ESS.Out of the 256 MB of DRAM available on our Radisys 4. Performance on IXP2400ENP-2611 card, 192 are under the control of the Linux ker- For our performance analysis of WASP on IXP platform, wenel running on the XScale, 32 are used as packet buffers by the mostly focused on two “reference” functions that were alreadymicroengines, and the remaining 32 MB hosts the 4 ephemeral available on the ESP implementation. This allows us to evalu-stores. As depicted on Fig. 4, one ESS is made of a hash table ate the overhead introduced by bytecode interpretation againstin SRAM and an entry table hosted in DRAM, where each en- a pre-compiled version of the same function. The count func-try consists of a 64-bit key, its associated 64-bit value and meta- tion only needs one variable in the ESS that is used as a counter.data for a total of 24 bytes. Every slot k in the hash table points Each packet increments the counter and drops itself if the coun-towards the oldest entry in the store where hash(K1) = k. ter is above a given threshold. The collect function is usedOther colliding entries (e.g. K2 and K3 on Fig. 4) that hash to aggregate samples from n sources, using one ESS variableto the same value k are simply chained in their creation or- to store the temporary aggregate for the k sources that haveder. Thanks to this organisation, periodic cleanup of the ESS is already been processed, and a second variable to count the re-greatly simplified. We just sequentially scan entries in DRAM maining n − k sources that should still give their sample. Astarting from last_cleared mark until we hit an entry ecollect packet is forwarded only when the n samples havewith e.ctime + τ > now. When e.g. K1 expires, we replace been aggregated and then continues towards its destination withpointer in SRAM slot k by a pointer to K2 (e.next) and we the aggregated sample.advance last_cleared. The pseudo-code for count and collect functions, asThe initial implementation of ESS on IXP was borrowed well as their translation in WASP bytecode and an example offrom [4] and was accessed through two opcodes (lookup and how they can be used on a merging tree is detailed5 in [20].insert) that mimics statements val = get(key) and put(val, Through the rest of this document, W: prefix will be usedkey) found in the pseudo-code of ESP functions. In previous *to refer to a WASP packet (e.g. W:count) while E: prefixwork on x86 architecture [17], we explored various ways to op- *refers to ESP packets.timise ESS accesses to compete with the performance of ESP.A first approach, later referred to as “/cache”, simply adds a 4.1. Latency Measurementsone-entry cache storing the last key looked up and the addressof the corresponding entry in DRAM. The effect is that we im- All latencies in this section have been measured directlymediately know where to write back a value with the insert on the IXP, through instrumentation of the “receive” (RX) andmicrobyte. This saves one hash table lookup in SRAM andchain walking in DRAM. We observed that maintaining a larger 5A lighter version can also be found at http://www.run.monte-cache is not worth the effort for WASP programs. fiore.ulg.ac.be/˜martin/resources/wasp-prm.pdf6Figure 7: Packet flow in our testbed, highlighting the full-duplex use of theloop-back link.Figure 6: Cumulative Distribution of packet forwarding latency at low through- of the samples are located no further than 52ns from the clos-put for collect instruction est step centre – which corresponds to variance in latency ofDRAM accesses. Those steps are intriguingly spaced by 516ns“transmit” (TX) microblocks provided by Intel as part of the on average, with very small deviation, independently of the pro-IXA SDK, after proper synchronisation of ME timestamp coun- tocol or function considered. By deeper inspection of the chainsters. The 1Gbps interface cards of the PCs in our lab indeed stored in the ESS, we can state that this is not an artifact ofproved to introduce load-dependent delays6 that interfere with some hash collision, nor an extra delay incurred by the creationall trace-based measurements – not mentioning NTP offsets or cleaning of entries.seen as high as 10µs. Taking advantage of the internal structure Actually, if we look at latencies until packets are queuedof the RX and TX code, we further separated latency measure- to TX microblock, we get a smooth, Gaussian-like distribution.ments depending on whether packet size was below or above Together with the fact that the delay of 516ns almost exactly128 bytes7. This later allowed us to isolate measurements for corresponds to the time required to transmit a minimal-sizeda specific class of packets by artificially “inflating” all other packet on the 1Gbps medium, it sounds reasonable to considerpackets we submitted to the IXP with padding bytes. that the “stepping” is introduced by the transmit hardware de-In the conditions described above, our WASP filter forwards pending on whether it is found idle or busy when our transmitW:count packets with an average latency of 6.34µs. Dumb request is submitted.packets of identical size have an average latency of 2.45µs andW:count packets that are not processed (i.e. going through 4.2. Throughput Measurementclassifier tests, but without bytecode interpretation) take on av- 4.2.1. Methodologyerage 2.91µs. The same experience repeated with E:count The maximum capacity of our traffic generator (on a dual-packets showed an average latency of 5.47µs. Comparatively, core Xeon, 3Ghz) was 907 Mbps with all packets having thethe IPv4 forwarder demonstration application [21], takes on av- maximum size and 170 kpps (98 Mbps) when using only pack-erage 6.92 µs to forward a 64-byte packet under 25% through- ets of minimum size, which is way below the theoretical max-put. The results for collect (Fig. 6), on the other side, are imum throughput of a single port-pair Gigabit Ethernet (1.488less impressive: 136% of the native version. With 8.58µs for Mpps). In order to keep our tests independent of additionalWASP against 6.29µs for ESP, we clearly see the impact of a equipment (e.g. aggregating hubs) and to minimise the amountlonger WASP program here. of code to be modified on the network processor, we opted forWe repeated the experiment with the one-entry cache en- a half-software, half-hardware “traffic accelerator” testbed.abled, which only led to a slight improvement of 0.37µs and As depicted on Fig. 7, we connected port 2 of the ENP-0.78µs respectively (see WASP /cache series on Fig. 5). How- 2611 board to a single test machine and wire the two remainingever, as reported through the WASP /map series on Fig. 6, mak- ports together. We have then rewritten the classifier’s rules toing use of map strongly improved the latency of W:collect enforce the following policy:packets which is now almost equal (101%) of native ESP coun-terpart. Another good thing is that /map also achieves good per- 1. regular packets coming on port 2 are returned to port 2;formance (106% of E:count latency) although it was already 2. WASP/ESP packets coming on port 2 are delivered eitherusing only one key. It even performs better than /cache although to port 0 or 1 depending on a bit of the Computation IDit now transfers twice as much memory, thanks to DRAM burst (CID) field;transfers. 3. packets received on port 0 are delivered to port 1 and vice-We can also observe on Fig. 5 that the latency distribution versa.is mostly split into two (or sometimes three) steps and that 95%The result is the creation of a two-way loop involving oneWASP/ESP processing and one transmitting delay that keeps6Likely due to interrupt moderation mechanisms on Broadcom NetXtreme packets until they drop themselves. We can “load” this loop byBCM5701 and on-board Intel 82541GI/PI Gigabit Ethernet controller7Which is the size of an m-packet: the atomic data unit between the switch sending incremental traffic from the test machine and watch thefabric interface and the processing unit of the IXP7# queues 1 2 4 6 8 16 length:#ME 2:1 6:1 10:1 2:2 6:2 10:2WASP (1ME) 315 528 653 746 764 788 ESP 1502 1296 1124 1733 1470 1314ESP (1ME) 390 672 1096 1258 1430 1546 W /cache 946 902 826 1858 1686 1476WASP (2ME) 315 632 1026 1190 1293 1552 W /map 774 730 680 1526 1396 1225ESP (2ME) 390 779 1298 1603 1744 1744Table 2: Throughput (kpps) with 16 queues, depending on the average hashTable 1: throughput (kpps) of WASP and ESP microblocks, processing count chain length, with one (left) or two (right) microengines.packets with single entry per hash chain and varying number of active queuesand hardware contexts.We then reproduced the experiment with 1 to 8 threads bal-anced on two different MEs to estimate how increased CPUimpact on the system. We of course adjusted ESP functions and power improves the performance. A “4 threads” setup thusWASP microbytes to allow unlimited thresholds. means that we will have 2 active queues served on each ME.Rule #1 ensures that the loop remains noise-free during our Comparing rows 1 and 3 in Table 1 confirms that the WASPmeasurements. We have indeed observed that the test machine interpreter is CPU-bound. Indeed, while balancing the loadautomatically exchanges a few packets every time we reload on two different MEs leads to throughput increased by 20%the test software, which may quickly lead to an extra 200 kpps with ESP, WASP sees its throughput improved by 57 (4 queues)stress on the loop. Temporarily disabling rule #1, we can mea- to 70% (8 queues), and the maximum throughput when all 16sure a maximum throughput of 2 972 kpps in the loop for “reg- queues are used has almost doubled.ular” packets, which is 99.86% of the theoretical maximumthroughput of a full-duplex Gigabit Ethernet link. Any through- 4.2.3. Increasing Hash Chain Lengthput limitation that we will measure with ESP and WASP pack-In order to estimate performance of a saturated store, we ex-ets will then be attributed either to WASP/ESP microblock, ortracted chains of colliding keys observed in the store and gen-to WASP/ESP specific part of the classifier.erated for each individual “computation” a series of k packetsWe also observed that packets that simply start with a FWDthat will reference one of the k keys that belong to the samemicrobyte can be processed at a maximum throughput of 2 966chain, therefore experiencing chain traversal of length from 1kpps by a single ME. The same program padded to 16 mi-to k.crobytes and carrying one bank of unused data (thus with aWhen processing on a single ME, we can note that the gapsimilar fetch/checksum cost than a W:count) will grow tobetween ESP and WASP performance is reduced (from 50 to100 bytes on wire and will limit the throughput to 2 046 kpps40%) as the average chain length increases. We can also ob-– 98.22% of the theoretical maximum for packets of that size.serve that the relative throughput reduction is less important inthe case of WASP (the worst observed throughput is 87% of the4.2.2. Count Performance with 1 Entry per Chainbest one with WASP, against 74% in the case of ESP). Table 2Using the “transmitted packets” counter of the TX micro- gives the observed throughputs for both E:count and the twoblock, we estimated the throughput of flows of count packets flavours of W:count already discussed in section 4.1. Sur-using both WASP and ESP while varying the amount of pro- prisingly, when we fully load the two MEs, the WASP packetcessing resources activated. Since the classifier uses the CID using lookup outperforms the native implementation offeredfield to select the queue a packet should take to reach WASP by ESP, and this regardless of the ESS state. The final expla-microblock, and since only one hardware thread can operate on nation has been found in the code: the ESP microblock – in itsone queue at a time, we can indeed decide how many threads current state – will re-generate the CRC checksum and update(and MEs) can be active at a time by simply restricting the val- packet header and operands in DRAM regardless of the compu-ues allowed for CID in the traffic generator. The best perfor- tation performed. The WASP VPU comparatively keeps trackmance of ESP and WASP on a single ME correspond to 58 and of data and state “dirtiness” and will only issue a DRAM up-38% of the maximal throughput, respectively. date when the content of the packet has been modified – whichBy translating throughput measurement T (n) (where n is never happens in the case of count.the number of active threads) into inter-packet delays D(n) =n/T (n), we can observe that each new active thread on the ME 4.3. Throughput of Collect functionincreases that delay by δn (almost constant and in the range800-900ns), making D(n) ≈ D(1) + (n − 1)δ almost linear We repeated the experience with collect packets usingwith n in the case of WASP. We cannot apply a similar model to only the kth entry in chains. Such chains are built with countESP, as it uses a special read-and-modify bus cycle that reduces packets that drop themselves immediately after execution. Athe available DRAM bandwidth as the packet rate increases. vertical bar in Fig. 8 reports the throughput of one scenario (i.e.,While there are only 8 hardware threads on a ME, we can either ESP or WASP, and chain length) for increasing amountsee on Table 1 that all the 16 queues have been required to of processing power. We can observe here again how ESP takesachieve the highest throughput. This can be charged to the time advantage of additional threads and how WASP rather benefitsrequired to probe 8 (empty) queues and has been avoided in fur- from an additional ME, even with the same number of threads.ther tests by properly balancing the k active CIDs among the 16 We have seen in section 4.1 that we could come up with aavailable queues (rather than using queues 1..k). similar latency for the collect operation for a single thread,and as expected, on a single ME, WASP remains way behind in85. Towards DeploymentWith latencies below 7µs and throughput up to 1.5 Mpps,WASP can offer an interesting trade-off between flexibility andperformance from an end-user point of view. From an oper-ator’s perspective, however, the amount of resources requiredfor a guaranteed service level are prohibitive.While the WASP service model allows partial deployment,and even deployment only on some interfaces through sidekickfilter boxes, it still lacks some level of fine-tuning that would al-low the operator to decide what amount of resources he’s will-ing to dedicate to WASP traffic, and protect his router againstunusually high demand for WASP processing. This sectionexplores possible adjustments to the proposed implementationFigure 8: Measured throughput (kpps) for the collect operation with chain that can take advantage from the knowledge of “normal” WASPlength varying from 2 to 10 entries, for both ESP and WASP (e.g. E6 is ESP load to lower the required resources. Any such approach, un-walking a 6-entries chain), varying the amount of threads and ME used for less properly protected, becomes vulnerable to attacks whereprocessing (e.g. 2x8 is 2 MEs, each processing 8 queues). a group of hosts craft a flow of WASP packets exceeding theavailable resource, therefore degrading or denying WASP ser-terms of throughput (see e.g., the 1x8 series). With two MEs vice to other packets.doing WASP processing however, we can now slightly outper-form ESP. In this case, both ESP and WASP have to commit 5.1. Alternate Structures for the ESSthe packet variables into DRAM. The performance improve- The organisation of the ESS as a hash table, as presented inment can thus be fully attributed to the reduction of memory section 3.4, has clearly the drawback that excessive chain lengthaccesses during the ESS entries lookup. This is true as long as may degrade performance up to the point that service will beWASP does not hit the memory bandwidth limit, which hap- denied. As a first defensive measure, we can salt the hashingpens with W10. Indeed, while WASP requires less memory ac- function with a local random value and enforce a maximumcesses, each memory access transfers twice the amount of bytes chain length, so that attackers cannot craft a collection of packetper access, theoretically requiring more DRAM bandwidth than that ends up in the same chain. Still, the total amount of keysESP to sustain the same number of packets per second. we could store in the ESS is limited by the amount of expensiveRather than reading a full bank (48 bytes, including meta- SRAM which defines the number of chains the ESS can have.data) while walking the chain, we repeated the experiment with We considered the alternative of balanced trees for the ESS,a modified map implementation that only fetches 24 bytes dur- as these have better worst case. The constraints would be thating chain walking, and then issues an extra access to get the re- the tree can store 30 million items8 with a maximal depth of 8maining 24 bytes once the correct entry is found (the “max” se- levels9. As a first estimation, this is only possible if nodes haveries for W8 and W10 on Fig. 8). This indeed slightly improved at least a degree of 9, but this could not be arranged so that thethe throughput for W8 and W10 (from 924 to 928 kpps and node (9 pointers and keys) fits the transfer registers of a single829 to 840 kpps resp.), but actually degrades the performance ME thread.for chains of 6 entries and below. As a potential alternative, Another possible option would be to use a forest of N treeswe could let the extra 24-byte transfer happen while the VPU that satisfy the constraints mentioned above and to map eachcontinues to process the next instructions, and suspend execu- key to a single tree without memory lookup (e.g. through a hashtion only if the data are still missing when we further advance function). As a first estimation, 216 B+tree holding at most 4in the memory bank. This is typically an efficient programming keys per node would meet the constraints. However this ap-technique on the MEs, and our first estimations suggest that we proach would involve a memory overhead of 90% of the valuescould achieve up to 877 kpps for W10. It would require, how- stored10, and we expect difficult implementation of insertionsever, a significant revision of our code since we need to detect and deletions in a way that keeps the periodic store cleanupand update the partially mapped bank transparently. lightweight enough (in terms of additional DRAM bandwidthIn other words, if we were to implement both WASP and an- requirements).other function that is more memory-bound on an IXP network The ’forest’ approach would not completely solve the caseprocessor (such as IP table lookup or execution of pre-compiled where one of the trees receives significantly more keys than theoperations on the ESS such as ESP), it would be preferable tobalance the amount of threads we are willing to dedicate to theWASP interpreter on the n available microengines (thus sharing 8Actually 29,660,000 items, assuming a top rate of 2,966 Kpps, and allmicrocode and local store with the other functions) rather than WASP packets creating an entry for 10 seconds9Which we experimentally determined as the maximal number of DRAMgrouping them on a single microengine. accesses we can afford per ESS lookup.10Assuming 32 bytes of value per key, keys alone accounting for 25% over-head9kppsrest of the forest either, and is thus just a way to allow “longer size might differ, and thus this cannot be enforced as a rule forchains” with the same amount of DRAM requests. It could thus “fair” packets globally.be preferable to stick to the hash table approach, but to oper- Compiling bytecode into native code is a well-known tech-ate it fully in DRAM, and to dedicate enough memory (28MB nique used to speed up execution of bytecode languages which,per store) to hold the hash pointers in order to keep all chains if applied here, could allow us to sustain a given traffic withshort enough, though this still has to be confirmed by additional reduced CPU power. In the IXP network processor, each mi-performance measurements. croengine has its own control store, capable of storing 4096instructions (called µwords), and only the XScale core can alter5.2. Best-Effort ESS the contents of those control stores.Rather than engineering the size of the state store to allow Our measurement shows that, below 1000 µwords, the con-every packet to create new state, which requires up to 1.3 GB trol store needs 0.25µs per µword written. It must also be notedfor a full-duplex Gigabit Ethernet interface, an operator might that reprogramming is only feasible when the ME is halted,prefer to estimate how much memory is sufficient to sustain which leads to an extra delay of 30µs in our setup. The hand-daily traffic through statistical analysis. Users would then ex- crafted collect function of the ESP filter, for instance, re-pect the operator to ensure fairness when load is increased be- quires 60 µwords of unique code, and we estimate that a di-yond the level the installed amount of memory can support. rect translation of WASP microbytes into IXP native code couldWhen deciding whether a WASP packet should be allowed to take up to 100 µwords. Assuming that both the XScale and thecreate a new entry, it would then be interesting not only to count ME are ready for the reprogramming of a filter function sim-how many entries have been created by this aggregate in the last ilar to collect would thus cost between 45 and 55µs. Ifτ seconds, but also how long the current packet is. we suppose that this allows a W:collect packet (3.45µs) toHowever, the potential denial of state creation in a router in- have the processing time of E:count (2.57µs), it still takesdependently of other routers’ decision may be inconvenient for 63 packets to amortise the cost of micro-store reprogramming.applications deriving e.g. from common trunk identification. In these circumstances, only very specific traffic patternsIndeed, when creating state in nearby routers, WASP programs could benefit from just-in-time compilation approach with IXPhave no guarantee that further routers will also accept new state. network processors. Yet, if the node can identify the k mostStill, if she optimistically creates state in routers close to her, the used WASP programs whose sizes do not exceed the free spaceuser consumes quota for her aggregate and might see further re- on the micro-store after compilation, it would be theoreticallyquest denied later on – just when she gets the chance to create possible to strongly improve the performance of the node whilestate near the core. keeping the ability to process any program and to dynamicallyIn case two neighbour domains support WASP and have ne- adapt to a new set of popular functions.gotiated a “fair rate” of WASP entries created per second, itwould be preferable for applications to be notified (e.g. through 6. Conclusionsnode environment variables) whether their packet are “in pro-file” for upstream entry allocation. A WASP program could We designed WASP after lessons learnt from former activethen avoid creating state nearby unless it is “blessed” by the routers, taking into account constraints in network processorrouter and receives the guarantee that it can install state in the programming. The result is a scalable virtual processor thatnext domain as well. can safely interpret user-emitted bytecode on router linecards.This has been achieved at the cost of a restricted programming5.3. Optimising through compilation model that does not allow most of the constructs found in aAlthough our interpreter is capable of latencies approaching general-purpose programming language. Yet, WASP bytecodethose of the ESP prototype and throughput slightly outperform- is expressive enough to implement several application-specificing ESP, we must not forget that WASP would keep the ALU network measurement and control protocols.of microengines almost fully busy, potentially leading to higher We have implemented and tested WASP VPU on the IXPpower consumption. Moreover, the count and collect pro- 2400 network processor in a “filter box” setup. Under low load,grams used in our tests remain relatively short (16 bytes) com- the interpreter is competitive with pre-compiled operations aspared to the longest program allowed in WASP (64 bytes). seen in ESP and the advantage of larger entries for the stateWe thus repeated throughput measurements with “bench- store has been confirmed. The VPU processing makes howevermark” packets of variable length, mixing ALU microbytes and more intensive use of the microengines ALU and throughputaccess to packet scratchpad variables. It revealed that, although will not scale with the number of active threads as well as itWASP processing time increases linearly with bytecode size, does with ESP. It does however scale well with the number ofthe slope ranges from 7 to 10 times higher than the one of packet microengines. This advocates for integration of both code (ESPforwarding times. In other words, on our IXP2400 setup, a 24- and WASP), as well as run-time-compiled optimisations of fre-byte program should be placed in a 206-bytes packet to ensure quent WASP programs if any, on the same microengine, whichwire speed processing, and a 64-byte program should not be would better balance the ALU usage.found in a packet smaller than 470 bytes. Unfortunately, for We also observed that the increase of the average chainanother implementation, the ratio between code size and packet length in the state store may have an important impact on the10forwarding latency of WASP and ESP packets. A mechanism [14] J. Lu and J. Wang : “Performance Modeling and Analysis of Weblimiting the longest chain will be mandatory to support applica- Switches”, in Proc. of 31th International Computer Measurement Grouptions that measure network performance or that try to enforce a Conference, Dec. 4-9, 2005, Orlando, Florida, USA, pp. 665-672.[15] S. Wen and J. Griffioen and K. Calvert : “Building Multicast Servicesgiven quality of service. The size of the ESS is thus no longer from Unicast Forwarding and Ephemeral State”, in Proc. of IEEE OPE-the only key parameter for proper ESS behaviour: the amount NARCH’01Anchorage, Alaska, USA, Apr. 2001, pp. 37-48.of SRAM holding the hash table will define the average chain [16] I. Wakeman, A. Jeffrey, R. Graves, T. Owen : “Designing a Programminglength and the average latency of ESS accesses. Language for Active Networks”, unpublished work presented at Hipparchworkshop 1998. citeseer.ist.psu.edu/wakeman98designing.htmlWe initially opted for a interpreter-based solution because [17] S. Martin and G. Leduc : “Interpreted Active Packets for Ephemeral StateIXP2xxx series, unlike other NPUs [6], appear poorly suited Processing Routers”, in Proc. of IWAN’05, Sophia Antipolis, LNCSto Just-in-Time compilation. It is clear however that an IXP 4388, pp. 156-167.[18] D. Wetherall : “Active network vision and reality: lessons from a capsule-2400 barely has enough resources to handle a couple of Gbps based system”, Operating Systems Review, vol.33, ACM, Dec. 1999. pp.flows full of WASP packets. In a production version of WASP, 64-79.performance (and number of MEs available for other compo- [19] S. Bhattacharjee, K. Calvert E. Zegura : “On Active Networkingnents) could be strongly improved by a control component that and Congestion”, Technical Report GIT-CC-96/02, Georgia Institute ofTechnology. ftp://ftp.cc.gatech.edu/pub/coc/tech_reports/1996/GIT-CC-would compile native code chunks for the most frequent func- 96-02.ps.Ztions present in the bytecode. How we can efficiently identify [20] S. Martin : “WASP : Lightweight Programmable Ephemeralwhether we have or not a native code chunk for a given packet State on Routers to Support End-to-End Applications”, Doc-is still ongoing work. toral Thesis, University of Liège, 2007, ISSN 075-9333.ftp://ftp.run.montefiore.ulg.ac.be/pub/RUN-BK07-02.pdf[21] D. Meng, E. Eduri, M. Castelino : “IXP2400 Intel Network ProcessorAcknowledgements IPv4 Forwarding Benchmark Full Disclosure Report for Gigabit Ether-Sylvain Martin was a Research Fellow of the Belgian Na- net, revision 1.0”, The Network Processing Forum, March 2003.tional Fund for Scientific Research (FNRS) and also partiallyfunded by the EU under the ANA FET project (FP6-IST-27489).References[1] A. Campbell, S. Chou et al. : “NetBind: A Binding Tool for Construct-ing Data Paths in Network Processor-Based Routers”, in Proc. of OPE-NARCH’02, June 2002, pp. 91-103.[2] L. Ruf, K. Farkas, H. Hug and B. Plattner : “Network Services on ServiceExtensible Routers”, in Proc. of IWAN’05, Sophia Antipolis, LNCS 4388,pp. 53-64.[3] K. Lee and G. Coulson : “Supporting Runtime Reconfiguration on Net-work Processors”, in Proc. of Advanced Information Networking and Ap-plications, Vienna, Austria, April 2006. pp. 721 - 726.[4] K. Calvert, J. Griffioen, N. Imam, J. Li : “Challenges in Implementing anESP Service”, in Proc. of IWAN’03, Kyoto, LNCS 2982, pp. 3-19.[5] J. Moore : “Practical Active Packets”, PhD Thesis, University of Penn-sylvania, 2002.[6] A. Kind, R. Plekta and B. Stiller : “The potential of just-in-time com-pilation in active networks based on network processors”, in Proc. ofOPENARCH’02, pp. 79-90[7] T. Egawa, K. Hino, Y. Hasegawa: “Fast and Secure Packet ProcessingEnvironment for Per-Packet QoS Customization”, in Proc. of IWAN’01,Philadelphia, USA, October 2001, LNCS 2207, pp. 34-48.[8] H. Otsuki and T. Egawa, “A Retransmission Control Algorithm for Low-Latency UDP Stream on StreamCode-Base Active Networks”, in Proc. ofIWAN’03, Kyoto, LNCS 2982, pp. 92-102[9] K. Calvert, J. Griffioen, and Su Wen : “Lightweight network support forscalable end-to-end services”, in Proc. of ACM SIGCOMM, 2002, pp.265-278.[10] M. Conrad and H-J Hof : “A Generic, Self-organizing, and DistributedBootstrap Service for Peer-to-Peer Networks”, in Proc. of IWSOS’07,The Lake District, UK, September 2007, pp. 59-72.[11] S. Martin and G. Leduc : “Ephemeral State Assisted Discovery of Peer-to-peer Networks”, in Proc. of 1st IEEE Workshop on Autonomic Com-munications and Network Management, Munich, May 2007, pp. 9-16.[12] S. Martin and G. Leduc : “An Active Platform as Middleware for Servicesand Communities Discovery”, in Proc. of ICCS 2005, LNCS 3516 (part3) pp. 237-245.[13] P. Gill, M. Arlitt et al. : “The Flattening Internet Topology: Natural Evo-lution, Unsightly Barnacles or Contrived Collapse?”, in Proc. of PAM08,Cleveland, USA, April 2008, LNCS 4979, pp. 1-10.11