Skip to content

2. Microarchitecture Specification

Pedro Gimenes edited this page Nov 5, 2023 · 1 revision

The following micro-architectural specification details how the features and requirements discussed in Section [section:arch_spec] are achieved at the circuit level.

Overview of Dataflow

As previously discussed, the Node Scoreboard is responsible for allocating resources within the accelerator and driving internal interfaces with all other functional units to perform aggregation and transformation functions. Additionally, there are direct interfaces between the Prefetcher, AGE and FTE to bypass the NSB during data transfer. The main computational steps in a node’s lifetime within the accelerator are listed below (i.e. after layer and global configuration registers have been programmed into the NSB).

  1. NSBPREF: Request to fetch adjacency list.

  2. NSBPREF: Request to fetch incoming messages using offsets stored in Adjacency Queue.

  3. NSBAGE: Request to aggregate features with given aggregation function.

  4. AGEPREF: Request for incoming messages stored in Message Queue.

  5. PREFAGE: Response with requested features through the Message Channel.

  6. AGE: When aggregation is complete, place aggregation results into the next available slot in the Aggregation Buffer.

  7. AGENSB: Response signalling that aggregation for each Nodeslot is complete.

  8. NSBFTE: Request transformation once the count of buffered aggregations reaches the configuration parameter.

  9. FTEPREF: Request layer weights stored in the Weight Bank.

  10. PREFFTE: Sequence layer weights in required order for systolic modules within Transformation Engine.

  11. FTE: Write transformed features back into DRAM through the AXI Write Master interface or into the Transformation Buffer, according to the configuration parameter.

  12. FTENSB: Response packet signalling that transformation is complete for each Nodeslot.

See Sections [section:node_scoreboard], 0.3, 0.4 and 0.5 for further details on the microarchitecture of each functional unit.

Internal Interconnects

AGILE contains two AXI interconnects, the Memory Interconnect (MI) and the Register Bank Interconnect (RBI). The first has several masters and a single slave, connecting functional units of the design to the DRAM controller IP for memory access. This interconnect has 34-bit address width and a 512-bit data bus. RBI has a single master connected to several slaves, to provide access to the register banks within each functional unit of the design. In the simulation environment, the master is connected to the Testbench for stimulus driving. In the physical implementation, this is connected to the Host device through a PCIe/AXI-Lite bridge. The RBI follows the AXI-L protocol, which only supports a subset of AXI features required for register programming. The address and data busses are both 32 bits and burst transactions are not supported. In both interconnects equal priority was assigned to all the masters. A Round-Robin arbitration strategy was chosen to ensure fair access in the event of multiple masters requesting access to the same address space.

Prefetcher

The Prefetcher is driven through a valid-ready request/response interface with the NSB. The main sub-units within the Prefetcher are its multi-precision Feature Banks and Weight Banks, responsible for fetching and storage of incoming messages and feature update weights, respectively. Three AXI read masters are instantiated within the Prefetcher, for Adjacency List, Messages and Weights fetching. The first two are driven by the Feature Banks (see Section 0.3.2), while the last is driven by the Weight Banks. An AXI-L interface is used to control the Prefetcher’s internal register bank, containing layer configuration and control flags. Finally, the Prefetcher interfaces with the Aggregation Engine (AGE) and Transformation Engine (FTE) through its Message Channels and Weight Channels, respectively.

Weight Bank

Upon a request from the NSB, the Weight Bank fetches the matrix of weights required to run inference on a fully-connected layer. This takes place during the layer configuration phase, before Nodeslot programming. The weights are later used by the Feature Transformation Engine (FTE) to update each node’s embeddings. Each row of the weight matrix is stored in a separate Ultraram FIFO, such that the weights can be flushed in the required order through the Weight Channel for consumption by the systolic arrays in the FTE.

Weight Bank State MachineWeight Bank State Machine

As shown in Figure 1, the weight bank cycles between FETCH_REQ, WAIT_RESP and WRITE states while the weights are fetched. Each request to the AXI read master is for a single row in the weights matrix, i.e. up to 1024 features (or 4kB). Each AXI response beat contains up to 16 features, hence the required number of AXI beats is dynamically determined depending on the feature count. After receiving each response beat, the state machine transitions from the WAIT_RESP state to WRITE, where each of the 16 features is pushed into the FIFO over 16 cycles. After storing the last feature in the last expected response beat, the state machine either transitions back to FETCH_REQ (if there are more rows pending) or into WEIGHTS_WAITING. In the latter case, the weight bank waits for a request from the FTE to dump the weights over the Weight Channel.

In the DUMP_WEIGHTS state, the row FIFOs are pulsed such that weights arrive diagonally in the systolic modules in the FTE (see Section 0.5.1. This is achieved by an instance of the Systolic Module Driver (see Section 0.7.4). During the weights dumping phase, the internal FIFO read pointers for each row are updated, but no data is overwritten in the Ultraram blocks. When transitioning back to the WEIGHTS_WAITING state, the pointers are reset to their original value. As such, the weights are immediately ready to be re-used for updating subsequent nodes in the same layer, without the requirement for repeated fetching.

Feature Bank

Before a node’s incoming messages can be aggregated, the Prefetcher receives a request from the Node Scoreboard to fetch them from DRAM and store them in a local memory. There are 8 Fetch Tags in each Feature Bank, coupled directly to an Aggregation Manager in the AGE through the Message Channel. All Tags can issue requests concurrently, which are arbitrated through a Round-Robin Arbiter to provide access to the AXI Read Master. Once Aggregation Engine (AGE) starts requesting feature data, following from an aggregation request by the Node Scoreboard (NSB), these are issued sequentially from the Prefetcher by the corresponding Sequencer.

Prefetcher Microarchitecture for a configuration with 63 Fetch Tags. The AXI Read Master is driven from arbitration of the Request Engine, and the Sequencer reads data from the corresponding Fetch Tag according to the request from the Aggregation Engine.Prefetcher Microarchitecture for a configuration with 63 Fetch Tags. The AXI Read Master is driven from arbitration of the Request Engine, and the Sequencer reads data from the corresponding Fetch Tag according to the request from the Aggregation Engine.

Each Feature Bank receives requests through the Valid-Ready interface with the Node Scoreboard. First, the NSB requests the Prefetcher to fetch the adjacency list for the given node, while the Nodeslot is in FETCH_NB_LIST state. The payloads include the neighbour count and start address, which are transferred to the request logic for the Address Queue in the associated Fetch Tag. The request logic determines the number of bytes to be requested from DRAM from the starting address, given that each adjacency pointer occupies 4 bytes. For example, if a Nodeslot has NEIGHBOUR_COUNT = 256 at START_ADDRESS = 0x080000000), the Address Queue will be populated with data in the range 0x80000000 → 0x80000400 (i.e. 1kB). The Prefetcher sends a done response for the adjacency list fetch request once either (1) the full list has been stored in the Address Queue, or (2) the Address Queue is full, and the remaining message addresses will be fetched once the Nodeslot transitions to FETCH_NB_LIST state and the incoming messages start being fetched and stored in the Message Queue.

After the done response for the adjacency list fetch is sent, the NSB will eventually issue a fetch request for the incoming messages when the Nodeslot transitions to the FETCH_NEIGHBOURS state. The Request logic for the Message Queue uses the addresses in the Address Queue to issue requests to the AXI Read Master, along with the incoming messages address programmed in the Prefetcher register bank. Similarly, a fetch done response is sent when either (1) all incoming messages have been stored in the Message Queue, or (2) the Message Queue is full, and subsequent messages will be fetched once the Aggregation Engine starts consuming the features. In case (2), a PARTIAL response is encoded in the status field to alert the NSB that subsequent status updates will be issued by the Prefetcher, signalling when the last incoming messages have been actually stored (this will take place when the Nodeslot is in the AGGR state).

Fetch Tag

The adjacency and message queues in each Fetch Tag were implemented using the UltraRAM blocks in the Ultrascale+ FPGA, such that incoming messages can be easily sequenced to the AGE. As shown in Table [table:fpga_memory] (from Section [section:large_graph_handling] on Large Graph Handling), there are 1280 blocks available, each with a capacity of 288 Kbits, for a total capacity of 47.1MB. The use of these hardened blocks incurs a low cost to the usage of LUT elements due to the required increment/decrement counters and write/read pointers, however, the overall usage is dramatically reduced compared to using the LUT elements to implement the memory. The total size of the message queue is upper bounded by a corresponding feature size of 512, with 256 neighbours. This corresponds to 262kB per Tag, or 4MB across all fetch tags, which is well within the memory resource budget. This sizing supports the Pubmed graph without any requirement for partial streaming, which has 500 features with a maximum node degree of 171.

Multi-Precision Support

The primary advantage of AGILE over other GNN accelerators is its support for node-wise multi-precision computation. Within the Prefetcher, there is a parametrizable number of Weight Bank and Feature Bank instances responsible for each supported precision format.

Two-stage arbitration process for access to the Adjacency and Message AXI Read Master from each of the Fetch Tags across all supported precision formats.Two-stage arbitration process for access to the Adjacency and Message AXI Read Master from each of the Fetch Tags across all supported precision formats.

As shown in Figure 3, supporting multiple precisions requires a two-stage arbitration process to access the Adjacency and Message AXI Read Masters. The first stage arbitrates among Fetch Tags within each precision block (i.e. Feature Bank), while the second stage arbitrates across precision blocks.

Aggregation Engine

The Aggregation Engine (AGE) is responsible for performing permutation-invariant aggregation functions over all the incoming messages of each node. The AGE receives requests over a direct interface to the Node Scoreboard (NSB) and requests data from the Prefetcher over the Message Channels (MC). Upon receiving a request from the NSB for aggregation of a given Nodeslot, the AGE allocates one of its Aggregation Managers (AGM) and a subset of its Aggregation Cores (AGC) according to the Nodeslot’s numerical representation. After aggregation is complete, the AGE transfers the results to the Aggregation Buffer through one of its Buffer Managers.

Microarchitectural diagram of the Aggregation Engine. NSB requests are sorted into Request Queues in the Scheduler according to Nodeslot precision prior to Aggregation Core allocation. The incoming messages are distributed into the ACs from the Message Channels.Microarchitectural diagram of the Aggregation Engine. NSB requests are sorted into Request Queues in the Scheduler according to Nodeslot precision prior to Aggregation Core allocation. The incoming messages are distributed into the ACs from the Message Channels.

The time taken to aggregate all incoming messages for a node in a graph is a function of the node’s degree since more crowded nodes require a larger number of MAC operations. Previous accelerators have made use of static pipelines with a double buffering mechanism, where incoming messages are loaded into local memory while previously loaded messages are aggregated in a set of SIMD cores. In graphs with high variance in node degree, this leads to a large number of pipeline gaps since fast (isolated) nodes must wait for slow (crowded) nodes to release aggregation resources. To alleviate this, Aggregation Cores are asynchronously allocated by the AGE within the AGC Allocators (Section 0.4.2) following NSB requests.

Aggregation Mesh

One of the initial considerations in the design of the AGE was that the number of accumulators required for node feature aggregation is a function of the input feature count, which is a layer parameter. To achieve low inference latency, it was deemed crucial that the design should not require re-programming of the FPGA between layer passes. Hence to achieve efficient aggregation with runtime-configurable feature counts, the AGE was designed to contain a network of Aggregation Cores (AGC) which are dynamically allocated at run-time (see Section [section:age_agc_allocation] for details). The AGCs are placed in a 2D mesh Network-on-Chip (NoC) topology, where nodes communicate via network packets. The AGE dynamically determines and allocates the required number of AGCs depending on the input feature count. This also enables aggregating nodes from multiple layers simultaneously, removing the requirement for unloading the accelerator after each layer pass.

In summary, the main advantages of the chosen architecture are as follows.

  • Avoid layer re-programming: the required number of AGCs per node operation can be determined and dynamically allocated at run-time, depending on the layer’s input feature count.

  • Asynchronous node aggregation: while previous accelerators in literature have made use of static pipelines with a double buffering mechanism, leading to a large number of pipeline gaps due to the non-uniform distribution of node degrees, AGILE launches node aggregation as soon as enough resources are released, independently of other nodes on the accelerator.

  • Reduced congestion: for each Aggregation and Buffer Manager to interface directly with each AGC requires complex logic that grows exponentially with the number of processing elements. By utilizing the discussed NoC architecture to handle communication between the PEs, the assumed resource growth is linear at the cost of marginally increased aggregation latency owing to packet propagation.

Aggregation Mesh, with routers shown in yellow, Aggregation Cores shown in blue and Buffer Managers shown in purple. Aggregation Mesh, with routers shown in yellow, Aggregation Cores shown in blue and Buffer Managers shown in purple.

As shown in Figure 5, the lowest row in each Aggregation Mesh is occupied with Aggregation Managers, which are responsible for interfacing with Prefetcher Fetch Tags via the Message Channels and distributing features to the allocated set of AGCs. Additionally, the right-most column in occupied with Buffer Managers. Once an AGM is finished distributing features to AGCs, it will send a packet to each AGC instructing it to send its own features to the allocated Buffer Manager. The BM is then responsible for storing the received features in the Aggregation Buffer.

The AGE uses an open-source network router implementation provided by Galimberti, Testa, Zeni from Politecnico di Milano , which supports a Wormhole Switching architecture. Each packet is comprised of a Head flit, 0 or more Body flits, and a Tail flit. The head flit contains routing information such as source and destination node coordinates, while the tail flit may include data and/or “housekeeping" payloads to close the connection between two nodes. Each router is comprised of 5 ports (LOCAL, NORTH, EAST, SOUTH, WEST), each of which contains an input buffer which can only be allocated to a single packet. After receiving a head flit, the receiving port applies backpressure to all other ports, which must wait until the given packet is drained (i.e. the tail packet is forwarded). Galimberti, Testa, and Zeni’s implementation additionally supports a parametrizable number of Virtual Channels to increase network throughput, however only a single VC was used in the Aggregation Mesh to reduce resource usage.

The routers in the Aggregation Mesh utilize the Dimension Order Routing algorithm, meaning that the destination port for each received flit is dynamically determined to first align the X coordinates of the incoming packet, then the Y coordinates.

Aggregation Core Allocation

The Aggregation Core Allocator sits at the frontend of the Aggregation Mesh, receiving NSB requests which are demultiplexed by the AGE according to precision. The Allocator has visibility of the mask of free AGCs constructed from the concatenation of their individual free flags. During the design, a race condition was identified between the Allocator and the AGCs, due to the time taken for allocation packets to propagate through the network. To overcome this, the Allocator keeps an internal mask register of allocatable cores, which is updated each time an AGC is allocated to a Nodeslot. This accounts for the case in which allocation for a new Nodeslot of the same precision is requested before the AGCs assert their allocation status flag. The allocatable cores mask is updated when a deallocation pulse is received by the Allocator, which takes place asynchronously to the allocation process, in the event that the AGM sends its done response to the NSB.

After allocation, the Allocator issues an aggregation request to the required AGM, which encapsulates the original NSB request along with the coordinates of the allocated AGCs. These coordinates are contained in up to 64 allocation slots, since this is the maximum number of required AGCs for each Nodeslot, assuming a maximum feature count of 1024. An AGC count is also included in the request such that the AGM can consume the first N coordinate slots, where $N = \frac{\text{feature count}}{16}$.

AGC allocation within the Allocator takes place through a sequential round-robin mechanism. First, the required number of AGCs N is determined based on the layer feature count. Over the subsequent N cycles, the round-robin arbiter grants access to an available AGC, and the allocatable AGC mask is updated. Simultaneously, the X, and Y coordinates are determined according to the mesh dimensions set at compile time.

During implementation, an alternative approach was considered to extend the utilized Round-Robin Arbiter design to a multi-grant use case, such that all required AGCs could be allocated in a single cycle. This was left as future work to potentially reduce aggregation latency.

Aggregation Manager

The AGM’s primary function is to drive the Message Channel interface with the Fetch Tags in the Prefetcher and transfer received features and scale factors as network packets to the AGCs. Messages are stored in the Fetch Tags at the granularity of AXI beats (512b), meaning 16 features are stored per Queue element, which is the same number of accumulators in each AGC. The number of flits required to send a 16-feature block was analyzed during design space exploration, since higher payload data widths lead to lower aggregation latency, at the cost of higher resource usage in the input port block buffers. After experimentation, the payload width was set to 64b, meaning 2 features are sent per flit, or 8 flits per AGC packet.

Aggregation Manager state machineAggregation Manager state machine

As shown in Figure 6, the AGM begins by sending a 2-flit allocation packet to each AGC granted by the AGC allocator, containing the Nodeslot and required aggregation function. After the Message Channel request is accepted by the Fetch Tag, the AGM cycles between the WAIT_PREF_RESP and SEND_AGC states. When transitioning to SEND_AGC, an auxiliary counter is triggered to ensure the correct features are selected from the message beat. The head flit in the feature packet contains the scale factor which is multiplied with every subsequent feature before aggregation. This is required for both GCN and GAT network architectures, where each feature vector is multiplied by degree and attention scalars respectively. An internal pointer is updated when transitioning away from the SEND_AGC state to ensure destination coordinates are selected from the correct allocation slot in the next packet.

When the Message Channel response has its last flag asserted, indicating the Message Queue is drained, the AGM transitions to the WAIT_BUFF_MAN_ALLOC state. This triggers the AGE to allocate a Buffer Manager allocation. When one becomes available, the AGM transitions to BUFF_MAN_ALLOC_PKT state, where a request packet is sent to each of the allocated AGCs containing the allocated BM coordinates. The AGM then remains idle for a period of time while the AGCs send aggregated features to the BM for buffering. When the AGM eventually receives a done packet from the BM, an NSB response is generated. Although the NSB response at the AGE interface is valid-only, backpressure is allowed at the AGM interface since the AGE needs to arbitrate among all AGMs sending response signals simultaneously.

Due to the Wormhole Switching mechanism of the NoC routers, the AGM is required to wait for the input buffer on the local port to be drained of any flits before being freed. This ensures no conflicts when the AGM is allocated to a new Nodeslot, and no latency is added in the typical use case since most flits are drained while the NSB response pulse is being arbitrated by the AGE.

Aggregation Core

Each Aggregation Core starts operating after receiving an allocation packet from an AGM, containing the allocated Nodeslot and aggregation function. The AGC contains 16 feature aggregators, which are updated with incoming features in the order they are received from the network. This relies on the assumption that order is preserved for flits within the same packet across the network, which is maintained for the described topology. An internal counter is updated after each incoming flit is received, which is used to drive the required feature aggregators in order. After receiving each feature, the AGC decodes the source node coordinate to reject any packets not originating from the allocated AGM. This acts to reduce computation error in the event of packet misdirection within the network.

Aggregation Core state machineAggregation Core state machine

As shown in Figure 7, when the last_packet flag is asserted in the head flit of any incoming feature packet, this signals that the AGM has finished draining the incoming messages in the Fetch Tag, so the AGC proceeds to the WAIT_BUFFER_REQ state until a buffering request packet is received. This then triggers an internal sent_flits counter as all aggregated features are transferred to the allocated Buffer Manager, at the chosen granularity of 2 features per flit. The AGC is freed as soon as the local port’s input buffer is drained, without any further communication with the AGM required.

Each feature aggregator within the AGC contains a scale factor multiplier and a number of aggregator modules. AGILE supports user-defined aggregation functions, which can be defined using HLS or RTL modules and integrated within the feature aggregator wrapper module using a provided script. User-defined aggregators can be combinatorial or multi-cycle but must follow the specified interface where input features are driven with a valid-only protocol, i.e. no back-pressure is enabled. Required aggregators can be defined at compile time in a specified JSON file which is consumed by the build script. The base variant contains a sum aggregator, which can also support mean aggregation by subsequently pulsing a drive_division port in the final state of the AGC state machine. Finally, a “passthrough" aggregator is used during the first packet to reduce aggregation latency in case of multi-cycle aggregation functions.

Buffer Manager

After being allocated to a Nodeslot by the AGE, the Buffer Manager’s primary role is to receive feature packets from associated AGCs and store these in the correct range within the Aggregation Buffer. Due to the topology of the network, feature packets are not necessarily received in the order in which they should be stored, since AGCs may be allocated in any mesh position depending on runtime state. As such, each Buffer Manager requires knowledge of the order in which AGC packets should be stored.

When the AGE couples a Buffer Manager to one of the AGMs, the allocation slot coordinates are transferred and registered into the BM. The BM then keeps a count of received flits from each AGC in the coordinate array. When addressing the buffer slot, the most significant address bits (i.e. AGC offset) correspond to the relative position of the AGC co-ordinates in the allocation slot list, while the least significant bits (i.e. feature offset) correspond to the received flit count for the incoming flit’s source node.

Buffer Manager state machine. A done response packet is sent to the associated AGM in SEND_DONE state after all features have been buffered. The BM is deallocated when the transformation is complete and the buffer slot is freed.Buffer Manager state machine. A done response packet is sent to the associated AGM in SEND_DONE state after all features have been buffered. The BM is deallocated when the transformation is complete and the buffer slot is freed.

As shown in Figure 8, the BM cycles between WAIT_FEATURES and WRITE states while packets are being received from the AGCs. The exit condition is given by an agc_done mask which is constructed as follows; first, a bitwise mask of valid allocation slots is constructed by subtracting 1 from the one-hot representation of the binary AGC count. Valid allocation slots are then considered done when their received flit counter matches the expectation, while the non-valid slots are tied to 1.

Each Buffer Manager is directly coupled to an Aggregation Buffer slot, hence after feature buffering is complete, the BM remains in WAIT_TRANSFORMATION state (i.e. allocated to its Nodeslot) until the feature count in the associated slot drops to 0. This indicates the FTE is done driving the aggregated features through the Systolic Array.

It should be noted that due to the Wormhole Switching mechanism, flits from different AGCs cannot be interleaved while reaching the BM since each Virtual Channel in the router can only be allocated to a single packet at a time. As such, a potential improvement to the BM would be to infer the feature offset for each AGC from a payload field in the head flit of each packet. This would remove the requirement for storing the coordinates of each allocation slot, reducing register usage by approximately 88 bytes per BM in the default configuration. This was left as potential future work.

Multi-Precision Support

The AGE instantiates a number of Aggregation Meshes, parameterizable at run-time, according to the number of supported precision formats. Each mesh is isolated from others, so packets cannot be transferred across precision boundaries. Aggregation requests from the NSB are routed to the appropriate mesh according to the node’s precision. However, the arbitration for access to the NSB response interface assigns equal priority to all Aggregation Managers across all precision blocks.

Feature Transformation Engine (FTE)

The Feature Transformation Engine is responsible for fetching aggregated neighbour embeddings from the Aggregation Buffer and multiplying them with a matrix of learned parameters (received from the Weight Bank in the Prefetcher) to generate the updated feature embedding for each Nodeslot. The FTE starts working after a wakeup request from the Node Scoreboard (NSB), followed by an instruction to process the features present in the Aggregation Buffer. This request takes place when the number of available features matches the NSBCONFIGTRANSFORMATIONWAITCOUNT parameter in the NSB configuration. After computation, the FTE stores the results in the Transformation Buffer and/or writes them back to memory through its AXI Master interface.

Fast matrix multiplication is achieved in the FTE through an array of Systolic Modules (see Section 0.7.3). The FTE supports the multi-precision targets described in Section [section:introduction] by dynamically allocating NSB requests to a number of Transformation Cores according to their precision. Each Transformation Core interfaces directly with its dedicated Aggregation Buffer. Finally, the transformation wait count parameter enables run-time control of the latency/power trade-off; lower latency is obtained at lower settings since nodes are immediately computed without waiting time, however, this requires propagation of the weights through the array a higher number of times.

Transformation Core

Microarchitecture of the Feature Transformation Engine. Following a request from the NSB, the FTE pulses the Aggregation Buffer FIFOs based on the captured snapshot. The Gating Logic shuts down non-required Systolic Modules (SMs) based on the output feature count. Weights are sequenced into the SMs from the corresponding Weight Channels (WC), which are coupled to the Prefetcher’s Weight Bank.Microarchitecture of the Feature Transformation Engine. Following a request from the NSB, the FTE pulses the Aggregation Buffer FIFOs based on the captured snapshot. The Gating Logic shuts down non-required Systolic Modules (SMs) based on the output feature count. Weights are sequenced into the SMs from the corresponding Weight Channels (WC), which are coupled to the Prefetcher’s Weight Bank.

When the FTE is idle, and a request is received from the NSB, indicating the configured number of aggregated features are available, the mask of valid slots from the Aggregation Buffer is captured into a snapshot register. The snapshot is then used by the control logic to sequence feature data in the required pattern for multiplication in the systolic array. The snapshot mechanism is used so that, in the case when the TRANSFORMATIONWAITCOUNT parameter is lower than the slot capacity in the Aggregation Buffer, the FTE can fetch aggregations from the correct offsets while the AGE continues to populate the buffer and change the contents of the available slots mask.

State machine of the feature transformation coreState machine of the feature transformation core

Matrix multiplication between weights and aggregated features is achieved in the FTE through up to 64 16 × 16 systolic modules. Weight data is received through the weight channels, each of which is linked to one of the Fetch Tags in the Prefetcher’s Weight Bank. The weights are sequenced “from below" into each of the systolic modules at the required timing. The FTE instantiates a Systolic Module Driver (see Section 0.7.4) to pulse the Aggregation Buffer at the required timing to sequence features into the systolic module.

The dimensions of the Systolic Module are chosen based on the following considerations. As discussed previously, setting the TRANSFORMATIONWAITCOUNT parameter to half the slot capacity of the Aggregation Buffer is equivalent to a Double Buffering arrangement between the A and FTE. Although the Host application can set this parameter low for lower latency on single-node computation, this will be set to 16 in the common use case. This reflects the assumption that at runtime, approximately $\frac{1}{4}$ of the 64 Nodeslots will be present in the Transformation Engine, with the remained spread across the Prefetch and Aggregation phases. Hence, the height of the systolic arrays is sized to 16 to support the common use case of double buffering between the A and FTE.

Equation [equation:transformation_matrices] expresses the matrix dimensions of the multiplication, where is the updated feature matrix, and W is the matrix of learned parameters. It can be seen the number of columns in the result matrix is determined by the number of output features for each GNN layer. Hence, the total number of columns of Processing Elements is determined by the maximum supported output feature count. This is set to 1024, however, this is parametrizable at compile time. X̂**W = (WAIT_COUNT×IN_FEATURES) × (IN_FEATURES×OUT_FEATURES)

To support Message-Passing Networks such as GAT or GIN, the FTE may be required to store transformation results in a Buffer, to be used by downstream logic to compute outgoing messages. For simpler networks such as GCN, a node’s outgoing message is simply its feature embedding, hence the FTE stores transformation results directly back to DDR memory. This leads to reduced computation latency as the node skips the non-required late stages of the pipeline.

The buffering and writeback logic in the FTE is achieved through a shifting mechanism. To support large feature counts, the FTE contains up to 16386 Processing Elements, meaning it becomes unfeasible to connect each row of PEs to any arbitrary buffer slot. After running synthesis on a fully connected architecture, it was found that the incurred multiplexers are too combinatorially deep to satisfy timing constraints at the desired frequency of 200MHz. This complexity was reduced by taking values from only the first row of PEs. After the first row is buffered, each row is shifted up using the overwrite mechanism in the MAC units (see Section 0.7.3).

Clock Domain Crossing at Register Bank Boundary

While running synthesis on the finished design, it was found that the register bank in the NSB could not meet timing constraints at the desired frequency of 200MHz, due to the combinatorially deep logic involved with multiplexing AXI transactions onto the large number of registers contained in the Nodeslot Scoreboard. As discussed in Section [subsection:airhdl], this portion of the circuit is autogenerated by the AirHDL tool according to a JSON description of the required configuration registers. An initial attempt was made to reduce the complexity by reducing the number of Nodeslots to 32. This still did not meet timing, in addition to increasing latency, as fewer nodes could be pre-programmed while the accelerator was busy. The solution to the timing issue was to run the register bank circuit at a lower frequency of 50MHz while maintaining the remainder of the accelerator at 200MHz. This was achieved by introducing the multi-flop synchronization circuit shown in Figure 11 around each register in the register bank via an automation script that adds a wrapper around the autogenerated AirHDL code when building the register banks.

Double-flop synchronization circuit for data transfer across a clock boundary, comprised of two register stages clocked by the destination clock, with their data input tied to the data registered on the source clock. Double-flop synchronization circuit for data transfer across a clock boundary, comprised of two register stages clocked by the destination clock, with their data input tied to the data registered on the source clock.

The circuit in Figure 11 transfers data across the clock boundary with a reduced risk of metastability or data incoherency. Due to the difference in frequency, the setup constraints of the B register may be violated when data arrives, which would cause metastability as the register settles at the intended value, resulting in incorrect data being read by downstream logic. This risk is reduced by introducing the C register, since B is likely to have settled at the intended value by the time C registers its value, reducing the Mean Time Between Error (MTBE). Most registers in the NSB register bank are Read-Write from the software perspective, meaning data is transferred from the slow (50MHz) to the fast (200MHz) clock. Some registers are read-only, meaning data is written by the accelerator on the fast clock and read by software via the memory-mapped AXI-L Interface at the slow clock. In the latter case, the same circuit is utilized, although the source clock is at 20MHz (as shown in blue) while the destination clock is at 50MHz.

Library Components

During the implementation of the design, the need for a library of basic RTL components containing frequently-used functions was highlighted. In some cases, there was no readily-available Xilinx IP to achieve the desired functionality, while in others, the requirement for finer-grained control of timing and performance aspects called for custom implementation of the following base units.

AXI Read Master

The AXI read master is responsible to drive the AR (Read Address) and R (Read Data) channels in the AXI interface according to a user request. The upstream component defines the desired start address and byte count through a valid-ready request interface. The Read Master uses burst functionality to fetch the required data with the minimum number of AXI transactions. As such, the required beat count is dynamically determined, and the AXI fields are driven appropriately. As defined by the AXI Protocol , a single transaction cannot cross a 4kB boundary, so in this event, the Read Master partitions the request into the required number of transactions before cycling through its internal state machine. Finally, the response beats are passed through to the downstream logic through a valid-ready response interface.

Hybrid Buffer

As discussed in Section 0.4, the Buffer Managers in the Aggregation Engine receive aggregated features in a non-deterministic order, due to the topology of the AGC mesh, however, the FTE consumes aggregated features in order. This requirement highlighted the need for a buffer implementation that can behave as an addressable RAM on the write interface but a FIFO on the read interface. This was achieved by the Hybrid Buffer, which contains a parametrizable number of buffer slots. Each buffer slot was implemented using BRAM blocks, to reduce LUT and Flip-Flop usage on the FPGA.

FIFO implementation using dual-port BRAM blocks. The red arrows show the write pointer following several push pulses, assuming a write width of 32B. The black arrows show the read pointer following several pop pulses, assuming a read width of 16B.FIFO implementation using dual-port BRAM blocks. The red arrows show the write pointer following several push pulses, assuming a write width of 32B. The black arrows show the read pointer following several pop pulses, assuming a read width of 16B.

Additionally, dual-port BRAMs were used, enabling different widths and depths on the write/read interfaces. Since packet flits in the AGE mesh are 64-bit wide (i.e. containing 2 features each), these can be directly written into the Aggregation Buffer without the need for unpacking, reducing aggregation latency. These can then be read from the FTE at a 32-bit granularity by the Systolic Module Driver (see Section 0.7.4).

Systolic Module

The Systolic Array is a widely used architecture for evaluation of the core matrix multiplication. This operation is achieved by the Systolic Module, which consists of a square grid of Processing Elements (PE). Each PE consists of a Multiply-Accumulate (MAC) unit, which takes input features from North and West PEs. Additionally, the incoming features are propagated “forward" and “downwards" to the East and South PEs, respectively. The matrix dimension n is parametrizable at compile time, as well as the arithmetic precision of the operands. See Figure 13 for an illustration of the required sequencing of matrix values, such that each PE contains the values of the resulting matrix after all features have propagated through the module.

4 \times 4 systolic module. Each Processing Element takes two inputs, A and B. In each cycle, the inputs are propagated “forward" and “downward" to subsequent PEs. Additionally, A MAC core computes the product of A and B and adds the result to an accumulator register in the PE.4 × 4 systolic module. Each Processing Element takes two inputs, A and B. In each cycle, the inputs are propagated “forward" and “downward" to subsequent PEs. Additionally, A MAC core computes the product of A and B and adds the result to an accumulator register in the PE.

The MAC units in each PE perform the multiply-accumulate operation over 2 cycles, using Xilinx floating-point adders and multipliers. A register is placed after the multiplication stage to meet timing constraints, enabling the systolic module to be operated at 200MHz. In addition to the accumulators, each PE contains a bias adder and an activation core to fully perform the computation required for a Fully-Connected Neural Network layer. After driving the input matrix values, the upstream logic can pulse a bias_valid and then an activation_valid signal to overwrite the accumulator contents with the added bias and activated features, respectively. The activation core supports ReLU and LeakyReLU activations. Finally, the upstream logic can pulse a shift_valid signal to overwrite the accumulators with arbitrary data. This is used by the FTE during the writeback stage, as discussed in Section 0.5.

Systolic Module Driver

The Systolic Module Driver generates pulse signals in the format required to drive the read interface of the Hybrid Buffer such that data signals are made available with the required timing for the processing elements of a systolic module. This is achieved through a shift register of size HYBRID_BUFFER_SLOT_COUNT. After receiving a starting pulse, the least significant bit is set to 1. Subsequently, the register shifts after every shift pulse, up to a runtime-parametrizable pulse limit count parameter (this is set to the number of output features for the layer being executed). The driver should then pulse a subsequent HYBRID_BUFFER_SLOT_COUNT times until the register is flushed.