rtm_example.c

rtm_example.c
¶ This file implements a parallel version of Conway’s Game of Life.
¶ It demonstrates the structure of a MIMD task-parallel program written for the Rigel architecture using the Rigel Task Model, a Bulk-Synchronous Parallel programming model implemented by a concurrent, hierarchical task queue.
¶ It also demonstrates several useful idioms that take advantage of Rigel- and RigelSim-specific features.
¶ Includes For `fprintf`	#include <stdio.h>
¶ For `malloc`, `strtoul`, etc.	#include <stdlib.h>
¶ For `bool` (C99)	#include <stdbool.h>
¶ For Rigel-specific syscalls, inline ASM, and typedefs	#include "rigel.h"
¶ For RTM typedefs and function declarations	#include "rigel-tasks.h"
¶ For convenience RTM-related macro definitions like `RIGEL_LOOP`	#include "../../benchmarks/common/common.h"
¶ For `ALIGNED_MALLOC`	#include "../../benchmarks/common/macros.h"
¶ Front Matter Global Data
¶ Turn this on and off to enable/disable printouts of the Game of Life board every iteration	#define DEBUG #undef DEBUG
¶ This macro declares a global variable called `flag` that will be initialized to ‘0’ (clear). All threads except 0 will spin on this flag, and thread 0 will set the flag to 1 when it has completed initialization, and all threads will enter the parallel section.	ATOMICFLAG_INIT_CLEAR(flag);
¶ RTM allows for multiple task queues to coexist, and you supply an argument to many RTM function calls specifying the queue ID. in this case, hardcode that argument to 0.	const int QID = 0;
¶ Global variables are unfortunately (from a software enginering perspective) the easiest way to make values writable from thread 0 during initialization and readable from all other threads afterwards. As you’ll see later, you need to be careful about managing the L1 and L2 cache values of these global variables once thread 0 is done writing them and wants to make them globally visible.
¶ These two variables store the number of rows and columns in our Game of Life grid.	int R, C;
¶ These two variables define the size of the block of output cells handled by each task. This could easily be set dynamically.	int BLOCK_SIZE_R, BLOCK_SIZE_C;
¶ Number of iterations of Game of Life	int NUM_ITERATIONS;
¶ Two arrays for doing out-of-place iterations
¶ Could be bitvectors, but this makes the indexing easier and allows you to store additional per-cell info (like object number)	typedef int gol_cell_t; gol_cell_t WORLD_IN, WORLD_OUT;
¶ Define a helper macro for indexing into a 2D packed array. The `+2` is due to the 0-padding on the edges	#define IDX(i, j) ((j)+((i)*(C+2)))
¶ Task-Centric Memory Model Handling
¶ The mechanics of using the task-centric memory model without compiler or runtime support are a bit complex, but we have some helpers to make things easier.
¶ Bools that tell us what TCMM policy we’re using. Only one `TCMM_INV_` and one `TCMM_WRB_` should be set to 1 at a time.	int TCMM_INV_LAZY = 0; int TCMM_INV_EAGER = 0; int TCMM_WRB_LAZY = 0; int TCMM_WRB_EAGER = 0; inline void get_task_parameters(rtm_range_t td, gol_cell_t in, gol_cell_t out, int rmin, int rmax, int cmin, int *cmax) {
¶ Extract the `in` and `out` pointers from `v1` and `v2` in the task descriptor.	in = (gol_cell_t )(td->v1.v1); out = (gol_cell_t )(td->v2.v2);
¶ The cells within `in` and `out` touched by the task are uniquely identified by its `v3` due to the ‘1’ argument we pass to `RIGEL_LOOP()` below.	const int tasknum = td->v3.v3; const int tasks_per_row = C / BLOCK_SIZE_C; rmin = ((tasknum / tasks_per_row) BLOCK_SIZE_R) + 1; //`+ 1` due to 0-padding rmax = rmin + BLOCK_SIZE_R; cmin = ((tasknum % tasks_per_row) BLOCK_SIZE_C) + 1; //`+ 1` due to 0-padding cmax = cmin + BLOCK_SIZE_C; }
¶ descriptor_buffer_fancy.h implements a circular queue of task descriptors. The descriptor buffer will automatically process task descriptors as it fills up, and will process all task descriptors when you tell it to at the end of an interval.	#include "../../benchmarks/common/descriptor_buffer_fancy.h"
¶ We need to writeback all the values we wrote into the `out` array during a task, so that tasks next interval will read the new values	inline void writeback_descriptor(rtm_range_t td) { gol_cell_t in, *out; int rmin, rmax, cmin, cmax; get_task_parameters(td, &in, &out, &rmin, &rmax, &cmin, &cmax); for(int i = rmin; i < rmax; i++) for(int j = cmin; j < cmax; j += (__CACHE_LINE_SIZE_BYTES / sizeof(gol_cell_t))) RigelWritebackLine(&out[IDX(i, j)]); }
¶ We need to invalidate all the input values we read during a task, so that next time we can read updated values that may have been written by another thread.	inline void invalidate_descriptor(rtm_range_t td) { gol_cell_t in, *out; int rmin, rmax, cmin, cmax; get_task_parameters(td, &in, &out, &rmin, &rmax, &cmin, &cmax);
¶ The input area we read goes from (rmin-1, cmin-1) to (rmax+1, cmax+1)	for(int i = rmin-1; i < rmax+1; i++) for(int j = cmin-1; j < cmax+1; j += (__CACHE_LINE_SIZE_BYTES / sizeof(gol_cell_t))) RigelInvalidateLine(&in[IDX(i, j)]); }
¶ Task Code
¶ Simulate one iteration of the Game of Life for a rectangular block of cells. The rules are: If a cell is live and has less than 2 live neighbors, it dies due to loneliness. If a cell is live and has more than 3 live neighbors, it dies due to overcrowding. If a cell is live and has 2 or 3 live neighbors, it lives. If a cell is dead, it stays dead unless it has exactly 3 live neighbors.	void GOLTask(rtm_range_t td) { gol_cell_t in, *out; int rmin, rmax, cmin, cmax; get_task_parameters(td, &in, &out, &rmin, &rmax, &cmin, &cmax); for(int i = rmin; i < rmax; i++) { for(int j = cmin; j < cmax; j++) { bool live = (in[IDX(i, j)] != 0); int numLiveNeighbors = (in[IDX(i-1, j-1)] != 0) + (in[IDX(i-1, j)] != 0) + (in[IDX(i-1, j+1)] != 0) + (in[IDX(i, j-1)] != 0) + (in[IDX(i, j+1)] != 0) + (in[IDX(i+1, j-1)] != 0) + (in[IDX(i+1, j)] != 0) + (in[IDX(i+1, j+1)] != 0); if((live && (numLiveNeighbors == 2 \|\| numLiveNeighbors == 3)) \|\| (!live && numLiveNeighbors == 3)) out[IDX(i, j)] = 1; else out[IDX(i, j)] = 0; } } }
¶ Worker Thread
¶ Every thread runs this function after thread 0 completes initialization.	void GOLWorkerThread() { TQ_TaskDesc tdesc; TQ_RetType ret;
¶ Local `WORLD_IN`/`WORLD_OUT` pointers we can swap between iterations	gol_cell_t in = WORLD_IN, out = WORLD_OUT;
¶ Handle task descriptors for lazy writeback	descriptor_buffer db; unsigned int dbflags = 0;
¶ Initialize flags for descriptor buffer	if(TCMM_WRB_LAZY) dbflags \|= DESCRIPTOR_BUFFER_HANDLES_WRITEBACK; if(TCMM_INV_LAZY) dbflags \|= DESCRIPTOR_BUFFER_HANDLES_INVALIDATE; init_descriptor_buffer(&db, dbflags, writeback_descriptor, invalidate_descriptor); for(int i = 0; i < NUM_ITERATIONS; i++) {
¶ `RIGEL_LOOP()` is a helper function that enqueues a bunch of tasks in parallel across the entire chip. It only works correctly when all threads call it, so if that doesn’t hold in your use case, use `SW_TaskQueue_Enqueue()` instead.
¶ `RIGEL_LOOP()` enqueues a “one-dimensional” block of tasks with identical `v1` and `v2` and varying `v3` and `v4`. We pass in a task descriptor where we can set `v1` and `v2` arbitrarily; these values will be propagated to all enqueued tasks. In this case, it’s helpful to encode the current `in` and `out` pointers in the task descriptor to let the descriptor buffer know what data to flush.	TQ_TaskDesc td;
¶ FIXME Make this process TBAA-clean by defining a `union` of `gol_cell_t` and `unsigned int`	td.v1 = (unsigned int)in; td.v2 = (unsigned int)out;
¶ `RIGEL_LOOP()` could fail, for example, if the task queue overflows.	if((ret = RIGEL_LOOP(QID, (RC)/(BLOCK_SIZE_RBLOCK_SIZE_C), 1, &td)) != TQ_RET_SUCCESS) { fprintf(stderr, "Iteration %d, enqueue returned %d\n", i, ret);
¶ This call lowers to an `abort` instruction, which immediately halts the simulation.	RigelAbort(); }
¶ Dequeue and execute tasks	while (1) { ret = RIGEL_DEQUEUE(QID, &tdesc);
¶ Task queue is empty for now, try again	if (ret == TQ_RET_BLOCK) continue;
¶ Task queue is empty and we reached a barrier, go to the next interval	if (ret == TQ_RET_SYNC) break;
¶ Failure	if (ret != TQ_RET_SUCCESS) { fprintf(stderr, "Iteration %d, dequeue returned %d\n", i, ret); RigelAbort(); }
¶ Dispatch the task
¶ We have 128 bits of state to play with; for more complicated code with many task types in a single interval, v1 could be a function pointer, or an index into a table of task function pointers.	GOLTask((rtm_range_t )(&tdesc)); if (TCMM_WRB_EAGER) { writeback_descriptor((rtm_range_t ) (&tdesc)); } if (TCMM_INV_EAGER) { invalidate_descriptor((rtm_range_t ) (&tdesc)); } if (TCMM_WRB_LAZY \|\| TCMM_INV_LAZY) { add_descriptor(&db, (rtm_range_t )(&tdesc)); } }
¶ End of interval	if (TCMM_WRB_LAZY \|\| TCMM_INV_LAZY) { empty_all_descriptors(&db); }
¶ Print board if DEBUG is defined	#ifdef DEBUG if(RigelGetThreadNum() == 0) { for(int i = 0; i < R; i++) { for(int j = 0; j < C; j++) { fprintf(stdout, "%01d", out[IDX(i+1, j+1)]); } fprintf(stdout,"\n"); } fprintf(stdout,"----------------\n"); } #endif
¶ Swap in and out pointers	gol_cell_t * const temp = in; in = out; out = temp; } //End of all intervals }
¶ Main
¶ All threads will start here once they’ve finished `libc` initialization.	int main(int argc,char *argv[]) {
¶ Get this thread’s global ID (between 0 and `RigelGetNumThreads()`).	int thread = RigelGetThreadNum();
¶ These will be stack-allocated in all threads, but will only be used by thread 0 to free `WORLD_IN` and `WORLD_OUT` at the end.	gol_cell_t world_in_unaligned, world_out_unaligned;
¶ Initialization	if(thread == 0) {
¶ Put all other threads to sleep so we don’t simulate their spin-waiting and save time. This is purely a simulation time optimization, it is not necessary and does not impact program semantics.	SIM_SLEEP_ON();
¶ Argument Parsing	assert((argc == 7 \|\| argc == 8) && "game_of_life TCMM_STRATEGY R C BLOCK_SIZE_R BLOCK_SIZE_C ITERATIONS [RANDOM_SEED]"); int tcmm_strategy = (int) strtol(argv[1], NULL, 10); switch(tcmm_strategy) {
¶ Lazy Invalidate, Lazy Writeback (LILW)	case 0: TCMM_INV_LAZY = 1; TCMM_WRB_LAZY = 1; break;
¶ Lazy Invalidate, Eager Writeback (LIEW)	case 1: TCMM_INV_LAZY = 1; TCMM_WRB_EAGER = 1; break;
¶ Eager Invalidate, Lazy Writeback (EILW)	case 2: TCMM_INV_EAGER = 1; TCMM_WRB_LAZY = 1; break;
¶ Eager Invalidate, Eager Writeback (EIEW)	case 3: TCMM_INV_EAGER = 1; TCMM_WRB_EAGER = 1; break;
¶ No software coherence actions performed, simulates ideal hardware coherence.	case 4: break;
¶ Not allowed	default: assert(0 && "TCMM_STRATEGY not in 0-4"); break; } R = (int) strtol(argv[2], NULL, 10); C = (int) strtol(argv[3], NULL, 10); BLOCK_SIZE_R = (int) strtol(argv[4], NULL, 10); BLOCK_SIZE_C = (int) strtol(argv[5], NULL, 10);
¶ For simplicity, make sure R and C are evenly divisible by the block size to avoid edge case code.	assert(((R % BLOCK_SIZE_R) == 0) && "R must be evenly divisible by BLOCK_SIZE_R"); assert(((C % BLOCK_SIZE_C) == 0) && "C must be evenly divisible by BLOCK_SIZE_C"); NUM_ITERATIONS = (int) strtol(argv[6], NULL, 10); if(argc == 8) { int seed = (int) strtol(argv[7], NULL, 10);
¶ Seed the simulator’s RNG with these 32 bits of state replicated twice.	RigelSRand(seed, seed); } else //Seed with 0 RigelSRand(0, 0);
¶ RTM Structures
¶ Initialize task queue	RIGEL_INIT(QID, MAX_TASKS);
¶ Store all tasks in cluster-local task queues rather than pushing to a global task queue. This decreases contention and increases performance if task length is uniform, otherwise the decreased contention is offset by load imbalance.	SW_TaskQueue_Set_DATA_PARALLEL_MODE();
¶ Game of Life Data
¶ `malloc()` `WORLD_IN` and `WORLD_OUT` arrays and 2^11-byte-align them. The large alignment is for better DRAM row locality. Save the unaligned pointers so we can free them later. We allocate an (R+2)x(C+2) array to allow for zero-padding around the edges.	ALIGNED_MALLOC((R+2)(C+2), gol_cell_t, 11, world_in_unaligned, WORLD_IN); ALIGNED_MALLOC((R+2)(C+2), gol_cell_t, 11, world_out_unaligned, WORLD_OUT);
¶ Initialize `WORLD_IN` randomly.	for(int i = 0; i < R; i++) { for(int j = 0; j < C; j++) {
¶ Initialize cell (i+1, j+1) with a random integer between 0 and 1 generated by the simulator. This lowers to a `syscall` instruction, which lets the simulator generate a random number for us in 0 cycles.	WORLD_IN[IDX(i+1, j+1)] = RigelRandInt(0, 1); } }
¶ Zero-initialize edges of `WORLD_IN` and `WORLD_OUT`.	for(int i = 0; i < R; i++) {
¶ Left edges	WORLD_IN[IDX(i, 0)] = 0; WORLD_OUT[IDX(i, 0)] = 0;
¶ Right edges	WORLD_IN[IDX(i, C+1)] = 0; WORLD_OUT[IDX(i, C+1)] = 0; } for(int j = 0; j < C; j++) {
¶ Top edges	WORLD_IN[IDX(0, j)] = 0; WORLD_OUT[IDX(0, j)] = 0;
¶ Bottom edges	WORLD_IN[IDX(R+1, j)] = 0; WORLD_OUT[IDX(R+1, j)] = 0; }
¶ TCMM Cache Management
¶ Flush (writeback+invalidate in L1/L2) all data touched by thread 0. This has two purposes: In a real Rigel implementation with incoherent L2s, this would be required for threads in other clusters to read the written values. In a simulation environment, where a program would still work if these cache operations were omitted, our performance would be too good because cluster 0’s caches would start out super-hot.	RigelFlushLine(&R); RigelFlushLine(&C); RigelFlushLine(&BLOCK_SIZE_R); RigelFlushLine(&BLOCK_SIZE_C); RigelFlushLine(&NUM_ITERATIONS);
¶ Strictly speaking, we only need to flush the borders of `WORLD_OUT`, but we flush the whole array to make the code easier to read.	gol_cell_t inwalker = WORLD_IN, inend = WORLD_IN + ((R+2)(C+2)); gol_cell_t outwalker = WORLD_OUT; while(inwalker < inend) { RigelFlushLine(inwalker); RigelFlushLine(outwalker); inwalker += (__CACHE_LINE_SIZE_BYTES / sizeof(inwalker)); outwalker += (__CACHE_LINE_SIZE_BYTES / sizeof(outwalker)); } RigelFlushLine(&WORLD_IN); RigelFlushLine(&WORLD_OUT); RigelFlushLine(&TCMM_INV_LAZY); RigelFlushLine(&TCMM_INV_EAGER); RigelFlushLine(&TCMM_WRB_LAZY); RigelFlushLine(&TCMM_WRB_EAGER);
¶ Reset timer 0. Timer 0 will be used to time the parallel section, and will be output at the end of the simulation (look for `TIMER_0000`).	ClearTimer(0);
¶ Wake up all the other threads	SIM_SLEEP_OFF();
¶ Start timer 0.	StartTimer(0);
¶ Set the zero-initialized flag to 1, signalling the other threads to start the worker thread.)	atomic_flag_set(&flag); } else { atomic_flag_spin_until_set(&flag); }
¶ Parallel Section
¶ All threads, including 0, call the worker thread. The code in this function could be inline instead, but the worker thread is a useful structural idiom as it separates out the parallel section.	GOLWorkerThread();
¶ Cleanup
¶ Thread 0 should stop the timer and free global dynamically-allocated memory.	if(thread == 0) { StopTimer(0); free(WORLD_IN); free(WORLD_OUT); } return 0; }

rtm_example.c

Includes

Front Matter

Global Data

Task-Centric Memory Model Handling

Task Code

Worker Thread

Main

Initialization

Argument Parsing

RTM Structures

Game of Life Data

TCMM Cache Management

Parallel Section

Cleanup