Here are my notes for this morning's lecture...... JC MasPar MP-1 topics: system overview communication structures programming model example MasPar MP-1 3 main parts: FE, ACU, DPU FE: standard DECstation users log in, use Unix environment to edit, compile programs, etc ACU: "array control unit" a RISC CPU with own memory (128KB) attached to FE as I/O device on the main system bus use: generates control signals for DPU DPU: "data parallel unit" from 1,024 to 16,384 PEs, arranged in a 2D grid 1 PE: 4-bit ALU, 40 32-bit registers, up to 64KB local memory active bit; defines "active set" Communication X-net high level view: square (or rectangular, 2x1) grid with 8-way nearest neighbor connections toroidal wrap-around connections actual construction: o o \ / X <-- bidirectional switch; can route / \ info from any node to any other node o o allows a PE to communicate directly with 8 neighbors using only 4 direct connections (recall that all PEs route info in same direction at same time) Router allows point-to-point communications between any pair of PEs side view: ===== <------> ===== cross-bar switches / | \ / | \ ___________________________________ PE array cross-bar switch: square, allows any of N inputs to connect to any of N outputs on MasPar, they're both 32x32 (magic number -- the number of PE chips) so router allows 32 PEs to communicate with any other 32 PEs in one cycle, as long as they are on different chips firmware resolves conflicts when all 1024 want to send, receive; may take several cycles to die down Programming MPL -- MasPar Programming Language based on ANSI C users write one program, which is executed by FE and ACU parallel subroutines execute on ACU; special calling sequence sets up call to routine running on ACU (which can call procs on FE, too); calls can be synchronous or asynchronous 2 data spaces: one for vars allocated on ACU, other for vars on DPU vars on ACU are "singular", vars in PE memories are "plural" use C keyword, like "register" or "static", to define plural vars *compiler detects when an operation applies to a plural var, and generates DPU code for these operations example: int a; plural int b; a = 0; b = 0; b = a+1; a = b; /* ?? */ a = proc[i].b; expressions: a+1 singular b+1 plural a+b plural -- broadcast a to all PEs conditional operations: if (a > b) { .. /* PEs with local a > b do this statement */ } else { .. /* other PEs do this one... */ } control statements: if (a) ... if (b) ... if (b) p(b); else q(b); while (b) ... for (a = 0; a < b; a++) ... Parallel constructs: nproc, nxproc, nyproc (global singular vars) iproc, ixproc, iyproc (global plural vars) globalor(b) plural b; singular results Reduction a = reduceAdd(b) sum of b's on all active processors; singular result (actually reduceAdd32, reduceAddd, etc) (also: Mul, And, other "binary associative") explain how it works: log n time x = scanAdd(y,barrier) plural x, y, barrier "parallel prefix" operator xi = sum(j=0..i-1)y Communication: a = xnetNW[2].b; /* get b from neighbor 2 hops to the NW */ a = router[i].b; /* if i is a plural var, each PE can connect to a different source.... */ blockIn(fp,dp,x0,y0,xn,yn,size) -- copy from a vector on the FE, which has address 'fp', to a block of PEs. 'dp' is the local (plural) address of the PE data, 'size' is the amount of date per PE, and the "coordinates" of the PE block are x0,y0,xn,yn example: random number seeds Manuals: /usr/maspar/doc/*.ps print them, view them, whatever *log in to "beauty" MPPE -- MasPar Programming Environment In-class exercises -- inner product of x and y (4096x1) matrix multiplication of a and b (both 64x64)