i like Parrot's custom bytecodes
---
load/load, load/store, store/load, store/store
"Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The four categories of barrier can be specified individually"
also http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf wants a way to specify a barrier that only applies to labeled 'sync operations'
so varieties of memory barriers:
{load, store, *}/{load, store, *}, as sync or as data, and relative to only sync or to only data and i guess each load/store must mark if it is sync or data then and i guess each barrier should also say whether or not mmio (memory mapped io) should be affected ( http://lwn.net/Articles/283776/ ), or whether ONLY mmio should be affected so for example a memory barrier might say "load-sync/*-{sync*data}, cpu+mmio", meaning "don't move any sync loads or stores after this instruction before any sync loads before this instruction, including both cpu and mmio"
isync -- complete all previous instructions up to what is needed before you perform a context switch
incoherent instruction cache flush for self-modifying code
dependency barrier (causes dependencies to be ordered, only needed on Alpha)
command to wait until a write is 'visible' to all other CPUs, and command to wait until all reads from other CPUs are visible here; vs command to do the previous barriers but only w/r/t local reordering this two 'wait' commands in the previous sentence are similar to CRF's Reconcile and Commit , i guess. note that in my formulation, the sender must commit and the receiver must reconcile to ensure transmission. otherwise eventual consistency.
i'll define eventual consistency to say that FIFO consistency may be transiently violated but if you wait long enough and the sender doesn't change the value any more, you'll eventually see their latest value forever
general marker for sections of code dealing with threadlocal vs. shared memory? or is this part of the load/store stuff?
note: acquire/release can be implemented in terms of the above;
" Acquire semantics is a property which can only apply to operations which read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.
Release semantics is a property which can only apply to operations which write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order." -- http://preshing.com/20120913/acquire-and-release-semantics/#comment-20810
a release is load/store + store/store, and and acquire is load/load, load/store.
http://fileadmin.cs.lth.se/cs/Education/EDAN25/F06.pdf 56/70 "Example of Cumulative Ordering"; on Power, there is also an option for some barriers to be 'cumulative' and some to not, which just means transitive, e.g. causal transitivity
???
"
And a couple of implicit varieties:
(5) LOCK operations.
This acts as a one-way permeable barrier. It guarantees that all memory operations after the LOCK operation will appear to happen after the LOCK operation with respect to the other components of the system.
Memory operations that occur before a LOCK operation may appear to happen after it completes.
A LOCK operation should almost always be paired with an UNLOCK operation.
(6) UNLOCK operations.
This also acts as a one-way permeable barrier. It guarantees that all memory operations before the UNLOCK operation will appear to happen before the UNLOCK operation with respect to the other components of the system.
Memory operations that occur after an UNLOCK operation may appear to happen before it completes.
LOCK and UNLOCK operations are guaranteed to appear with respect to each other strictly in the order specified.
The use of LOCK and UNLOCK operations generally precludes the need for other sorts of memory barrier (but note the exceptions mentioned in the subsection "MMIO write barrier").
" -- https://www.kernel.org/doc/Documentation/memory-barriers.txt
"
(*) There is no guarantee that a CPU will see the correct order of effects from a second CPU's accesses, even _if_ the second CPU uses a memory barrier, unless the first CPU _also_ uses a matching memory barrier (see the subsection on "SMP Barrier Pairing")." -- https://www.kernel.org/doc/Documentation/memory-barriers.txt
--
some interesting ideas from various weird x86 features:
https://en.wikipedia.org/wiki/Control_register#CR3
---
x86 has a 'REP' (repeat) prefix for some instructions: http://faydoc.tripod.com/cpu/movsb.htm , which i guess is like a 1-element for loop
--
http://www.cc.gatech.edu/~rama/Beehive/papers/git.cc.91.51.pdf table 1: hardware primitives (paraphrased):
read without coherence write without coherence read-global read data from shared memory, bypassing cache write-global globally perform the write read-update read data from shared memory and request future updates reset-update cancel the request for updates from shared memory flush-buffer stall this processor until all requests in the write-buffer are globally performed read-lock request a shared lock for a memory block write-lock request an exclusive lock for a memory block unlock release the lock
---
" Popek and Goldberg summarized the concept in 1974: "Formal Requirements for Virtualizable Third Generation Architectures". Communications of the ACM 17 Equivalence/Fidelity A program running under the hypervisor should exhibit a behaviour essentially identical to that demonstrated when running on an equivalent machine directly. Resource control / Safety The hypervisor should be in complete control of the virtualized resources. Efficiency/Performance A statistically dominant fraction of machine instructions must be executed without hypervisor intervention. " -- ARMv7-A Architecture Overview, David Brash, Architecture Program Manager, ARM Ltd. www.linaro.org/documents/download/d7fe510d8eb46775afc3953d217b15224fbb93086598a (good read, slides with various details about ARM virtualization)
notes:
--
See also jasperLowEndTargets
--
" Synchronization Instructions 1990-1993-2005-...
atomic RMW compare-and-swap (CAS), e.g. x86 CMPXCHG x86 LOCK prefix fetch-and-add, e.g. x86 XADD double width compare-and-swap (DCAS), e.g. x86 CMPXCHG8B, CMPXCHG16B
"RISC" load-linked / store-conditional (LL/SC)
MONITOR/MWAIT
Advanced topics:
multi-location atomicity: compare-and-swap 2 locations (CAS2) - AFAIK not implemented by anyone KCSS LLn/SCn transactional memory (TM) " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010
" Generic Fence Options
Many different types of fences can be expressed as combinations of the above features.
Do not perform or make visible subsequent instructions of type All memory operations Loads Stores Fences Instruction Fetches Any specially marked versions of the above Until all earlier instructions of type All memory operations Loads Stores Fences Instruction Fetches Any specially marked versions of the above Have been completed
Further options:
In particular address ranges or types of memory Make visible / flush to certain points - e.g. flush into L2, but not further
Split Fences
TBD: my own work from the 80s
Recent 2009 papers Ordering vs. Reliability Fences
There are at least two different usage models for fences - or the related flush instructions.
Ordering Fences Folks who want to obtain semantic guarantees about memory ordering, and hence parallel program correctness. In general, these folks don't care if a fence instruction flushes a queue of memory operations, so long as it appears to do so.
Reliability Fences Folks who, e.g., want to ensure that all pending stores in store buffers have been flushed to battery backed up RAM. Or ensure that there are no loads in flight when they are trying to un-hotplug a module.
Strictly speaking the latter are flush operations, not fence operations. But the terms flush and fence are often mixed up, since the earliest fence operations were often implemented by flushing buffers. " -- https://www.semipublic.comp-arch.net/wiki/Fence_Instructions
"
Basic vector instructions
vector-vector instructions
Less common
scalar vector: vreg[i] += scalar, for all i " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010
" DSP ISA extensions
Branchless code instructions
conditional move (CMOV) conditional select (CSELECT) conditional load, conditional store do you fault if the condition/predicate is false? predication
Specific instructions - in integer, scalar FP, packed, etc. versions
absolute value (ABS) maximum (MAX), minimum (MIN) minimum and maximum (MINMAX) sort instruction
" -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010
" Security and DRM ISA Extensions 2000-2005
Microsoft [(Next Generation Secure Computing Base)] inspired hardware support efforts at CPU manufacturers:
Intel Trusted Execution Technology (TXT or TET), (originally Lagrande Technology (LT)). SINIT instruction
AMD: TBD?
Similar features were found on non-Wintel processors, such as ARM and MIPS, all seeking to support DRM.
In some ways beaten to product line by virtual machine (VM) support. " --https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010
" Random number generator (RNG) ISA support Virtual Machines renaissance 2002-... Cryptography ISA extensions 2008-... " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010
"
The Intel AESNI instructions look like multiply or divide-step instructions
AESENC Perform one round of an AES encryption flow. AESENCLAST Perform the last round of an AES encryption flow. AESDEC Perform one round of an AES decryption flow. AESDECLAST Perform the last round of an AES decryption flow. AESKEYGENASSIST Assist in AES round key generation. AESIMC Assist in AES Inverse Mix Columns.
And, as of 2010, multiple step and divide step instructions look as if they were a dead end, a solution that did not last long. I suspect that memory to memory engines may be the ultimate direction for cryptographic instructions. APIs such as
ENCRYPT <parameter block>,<state block>,<from, to address ranges> DECRYPT <parameter block>,<state block>,<from, to address ranges> HASH <parameter block>,<state block>,<from, to address ranges> " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010
" Parsing instructions
See SSE4.2 String / Text / XML instructions PCMP(ie)STR(im) Motion Estimation
MPSAD vector minimum entry and index (Multiple packed sum of absolute difference) GPU ISA features "
pcmpstr: http://halobates.de/pcmpstr-js/pcmp.html
http://halobates.de/blog/p/207
https://www.semipublic.comp-arch.net/wiki/X86_ISA#SSE4.2_String_.2F_Text_.2F_XML_instructions
Vector Conditional Moves
SSE4.1's BLEND instructions are misnamed - really are vector select under mask.
AMD's XOP has VPCMOV Vector Conditional Move and VPPERM, vector packed permute bytes.
VPCMOV is a bit level select: three operands, the select mask, and the 2 inputs. Can implement byte or element select by appropriate extension from a bitmask to an element mask, 0/-1.
VPPERM is more than an element wise conditional move - itt is a vector permute, on steroids. Permutation
Permutation instrictions are a perrenial bugbear.
Hard to get a fully general permute. Many special cases:
XOP
VPSHUFHW, VPSHUFD, VPSHUFLW, VPSHUFW permute VPERMIL2PS, VPERMIL2PD
Dot Product
https://www.semipublic.comp-arch.net/wiki/X86_ISA
Prefetch and Hint Instructions
Packed Vector Shuffles and Permutes
Packed vector Permute instructions have a long and annoying history.
Arguably, instruction sets such as Altivec jumped to the ideal: an arbitrary byte oriented permute instruction. However
what about 4 bit and 2 bit numbers - especially useful for genomics where does the permutation vector live? in register - inconvenient in constants - funny sized constant what about the hardware cost?
Equivalently, scatter/gather - whether register to memory, or register to register.
Considerations such as these have motivated a long list of special case permutation instructions.
SSE: SHUFPS, UNPCKHPS, UNPCKLPS, PSUFW SSE5: MOVDDUP, MOVSHDUP, MOVSLDUP
System, Virtual Machine & Security Instructions
It would take too long to exsplicitly describe these, so I will just list as described in wikipedia, http://en.wikipedia.org/wiki/X86_instruction_listings
Added with AMD-V CLGI, SKINIT, STGI, VMLOAD, VMMCALL, VMRUN, VMSAVE (SVM instructions of AMD-V)
Added with Intel VT-x VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON
Other Timer extensions
There is a perennial problem between absolute and process- or thread- or whatever virtual time - excluding interrupts, etc.
I (Glew) have repeatedly proposed instructions that read the timer and, atomically, add in one of several possible offsets from registers or structs. - thereby provided multiple virtual times. Possibly scaled. Along with instructions to manipulate such virtual times.
VPPERM
AMD XOP's VPPERM is more than an element wise conditional move - it is a vector permute, on steroids.
VPPERM is a permutation operator, with additional features
concatenates two vector operands, like a funnel shift byte permutation: 5 bit selection, 32B, 256b (2 concatenated 128b) extra per byte operation: select source byte invert source byte bit reverse source byte bit reverse inverted source byte 00h FFh most significant but replicated inverted most significant but replicated
This just fits in 128 bits; might not extend to 256, 512 bits. VPERMIL2PS or PD
Floating point, not quite as powerful as VPPERM, but still pretty major.
Selects from one of two source operandds, or 0, depending on a few bits of a 3rd selection operand, and an immediate.
Can be considered a special case of the general hardware of VPPERM, convenient to comparisons produced by VPCOMW, etc., instructions.
Vector Shifts
AMD XOP and Intel AVX contain vector element shifts - where the size is specified by the opcode, and the shiftcounts are specified either by a vector, with a different shiftcount per element, or else by a constant or scalar, with the same shiftcount per element.
TBD: make the size dynamically specifiable? Restrict to power of two, or make any integer number of bits?
Elsewhere I have discussed wide integer shifts, treating a packed vector register as a very long integer.
Vector element funnel shifts occur to me as a natural step: Take two vectors, concatenate on an element basis, and shift by a shift count e.g. scalar or in a corresponding vector register.
Instruction Prefixes
Not so much an instruction operation feature, as an extension feature.
x86 has had instruction prefixes since its beginning
LOCK segment prefixes REP/REPNE
AMD-64 introduced the REX prefix
REX
XOP introduces a 3 byte pefix
3 byte 8F RXB.m5 Wv4Lpp
AVX (and LRBNI) introduce the VEX prefix
VEX
Along the way, the C4 prefix was used
C4
FMA
What: FMA3 / FMA4
Carry-less Divide
What: Carry-free division (?Glewism) What: Reduction of an Integer Modulo a Polynomial Where: not yet (except CRC) When: future, if at all (likelihood high?)
Computations such as GCM in AES and Elliptical Cryptography frequently require the reduction of an integer, interpreted as a polynomial SUM of xi AND val_i, modulo another polynomial, in the Galois field using XOR.
Currently (2010) this is implemened using shifts and shuffles without hardware support - with the exception of the CRC instruction, which has a hardwired polynomial.
Glew conjecture: hardware support for this should be forthcoming.
Carry-less Multiply
What: Carryless Multiply (CLMUL) Where: Intel x86 nhm When: Proposed 2008 / shipped in Nhm 2010
Part of AESNI, although I break it out as more generic. Used in the Galois Counter Mode (GCM) of AES.
GCM requires carry-less multiplication of 128 bit operands, producing a 255 bit.
Vector-to-vector element-by-element shifts
256 bit AVX vectors
left shift logical (dword/qword) right shift logical (dword, qword) right shift arithmetic (dword)
Curiously, the qword (64b) version of vector element by element right shift arithmetic is missing. Why? Cost?
Floating Point Multiply Accumulate
Bit manipulation
ANDN And Not Z := ~X & Y BLSR Reset Lowest Bit Y := X & (X-1) BLSMSK Get Mask Up to Lowest Set Bit Y := X ^ (X-1) LZCNT Leading Zero Count TZCNT Trailing Zero Count BZHI Zero High Bits starting with Specified Bit Position
Gather
All combinations of
Gather datatype: SP FP, DP FP, Dword (32b), Qword (64b) Index datatype: Dword (32b), Qword (64b)
s
VPERM*, on dward (32b) and qword (64b)
Vector-to-vector element-by-element shifts
256 bit AVX vectors
left shift logical (dword/qword) right shift logical (dword, qword) right shift arithmetic (dword)
Curiously, the qword (64b) version of vector element by element right shift arithmetic is missing. Why? Cost?
Highlights from the point of view of instruction novelty (not necessarily absolute, but newness to x86):
vadcpi - vector add with carry in and carry out vaddsetcpi - vector add and set mask to carry vsbbpi - subtractvectors with borrow in and borrow out vsbbrpi - reverse subtract brrow vaddsetsp - vector add and set mask to sign
vclampzp - clamp vector elements between a value and zero
vcvtps2srgb8 - convert vector to sRGB8 vector
vector scatter/gather - index (address) vectors vexpand / vcompress - under mask?
bitinterleave: 1:1, 2:1, vectior/scalar
bsfi/bsri - bitscan forward / reverse, initialized start in the middle, after an earlier bsf/bsr
Some of Larrabee's most powerful features involved operand broadcast and swizzle, and more powerful convedrsions on memory operands. Full permute was not allowed, but a nice selection of permutes within a 128 bit "lane" of 4 32b elements.
identity dcba pair swap cdab two away badc broadcast aaaa, bbbb, cccc, dddd, rotate, as in cross product dacb
AVX, SSE5, and XOP
Between 2007 and 2011 Intel and AMD have been thrashing about packed vector extensions such as Intel AVX, and AMD SSE5 and XOP, with Intel LRBNI in the background.
This wiki page is not intended to be a history of this thrashing. This wiki page is meant only to capture the key innovations or technical aspects, forgetting the marketing spin. These aspects will be separate sections.
Key instruction set packages in AVX, SSE5, XOP (and LRBNI, to some extent):
non-destructive 3 operand A=BopC and 4-operand A = B op1 C op2 D FMA3 and FMA4 as aspects permute conditional move (in packed vector) wide integer (in packed vector register) select under mask vector shift/rotate (with per element counts) vector integer compare vector integer multiply accumulate w/wo saturation and width changes FP16
XOP
horizontal integer add/subtract add adjacent pairs of values FP fraction extract Vectorcobnditional Moves
Intel AVX FMA has the following:
double/single scalar/packed fused multiply add/subtract, operand optionally negated - FMADD, FMSUB, FNMADD, FNMSUB FMADDSUB - fused multiply alternating add/subtract
SSE4.1
What: Intel x86 SSE4.1 Where: Penryn When: 2007?
SSE4.1 with Intel Core 2 (Merom) is notable for
Motion estimation instructions MPSADBW and PHMINPOSUW Dot product instructions DPPS, DPPD Streaming load, MOVNTDQA BLENDPS, ertc. instructions misnomer: really are a conditional move or vector select under bitmask. packed integer min/max, e.g. PMINUB insert/extract instructions packed integer formatconversions quadword (128b) equality hole filling: PACKUSDW
SSE4.2
What: Intel x86 SSE4.2 Where: Nehalem Core i7 When: 2009
SSE4.2
string instructions (compare) Explicit vs Implicit Length returning index or mask byte specifying formats PCMPGT - 128 bit greater than compare filling in, partially, the whole whereby there were 128 bit equality checks but not direction checks CRC32 POPCNT just a 1 operand form - no ANDing or XORing
SSE4a
What: AMD x86 SSE4a Where: Barcelona When: 2009
LZCNT - leading zero count POPCNT EXTRQ/INSERTQ MOVNTSD / MOVNTSS - scalar streaming store instructions
SSE3
What: Intel SSE When: Pentium III, 1999
From wikipedia: http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
SSE introduced both scalar and packed floating point instructions. Floating point instructions Memory-to-Register / Register-to-Memory / Register-to-Register data movement Scalar – MOVSS Packed – MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS Arithmetic Scalar – ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS Packed – ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS Compare Scalar – CMPSS, COMISS, UCOMISS Packed – CMPPS Data shuffle and unpacking Packed – SHUFPS, UNPCKHPS, UNPCKLPS Data-type conversion Scalar – CVTSI2SS, CVTSS2SI, CVTTSS2SI Packed – CVTPI2PS, CVTPS2PI, CVTTPS2PI Bitwise logical operations Packed – ANDPS, ORPS, XORPS, ANDNPS Integer instructions Arithmetic PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW Data movement PEXTRW, PINSRW Other PMOVMSKB, PSHUFW Other instructions MXCSR management LDMXCSR, STMXCSR Cache and Memory management MOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE
Mostly unremarkable, although I may mention some, such as packed average, in more detail.
Meta-organization
Packages or Families of Instruction set extensions #AESNI #16_bit_FP SSE, SSE2, SSSE3, SSE4, SSE5, AVX, LRBNI, XOP
Cryptography #AESNI #Carry-less_Multiply #Carry-less_Divide
Vectors, Packed #Vector_Shifts #Add-Subtract #Vector_Horizontal_or_Pairwise_Operations Related #FMA #16_bit_FP
Synchronization #Monitor_and_Mwait
System instructions Virtual machines and security, Intel and AMD's different flavors.
" Bounds checking
HardBound?: Architectural Support for Spatial Safety of the C Programming Language. Joe Devietti, Colin Blundell, Milo M. K. Martin, and Steve Zdancewic. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2008.
http://www.cis.upenn.edu/acg/talks/asplos08_hardbound_talk.pdf http://www.cis.upenn.edu/acg/papers/asplos08_hardbound.pdf
Although Milo disavows this heatedly, HardBound? is really an example of a capability instruction set extension. It is really an attempt to make accessible to regular programs some of the features of secure pointers in instruction sets such as that of the IBM AS400. HardBound? differs, however, in not using fat pointers, but in placing the capability descriptors in memory locations separate from the pointers they are securing, permitting unchanged memory layout for data type binary compatibility.
HardBound? requires the compiler to identify where pointers are created, and what they are allowed to point to:
setbound instruction
Hardware checks and propagates the pointers. All registers and memory have metadata, describing, in this case, the bounds of the pointer. "
https://www.semipublic.comp-arch.net/wiki/Stream_descriptor#Stream_descriptor
"
Stream descriptor
A stream descriptor is a description, in a register or memory data structure, that describes a memory access pattern.
For example, a 1D stream descriptor might include
data item description - size, type base stride length current position
A 2D stream descriptor might include
data item description - size, type base horizontal stride, length vertical stride, length current position, row and column
Obviously extendable to N dimensions.
In addition to arrays, there may be other types of stream descriptor
A scatter/gather descriptor
a 1D descriptor of a vector of addresses with an indication that each addressshould be dereferenced
A linked list descriptor
Base a skeleton structure descriptor payload to be accessed while traversing the list - offset, type position within a list element of pointer to next element
A tree or B-tree descriptor
Base a skeleton structure descriptor payload key pointers
For the case of a tree or B-tree descriptor, the access pattern may (a) traverse all of a tree or subtree or (b) traverse only a path seeking an individual element Explicit versus Implicit Stream Descriptors
A stream descriptor can be explicit - e.g. it may be bound to a register, so that a memory access instruction might look like
destreg := LOAD [stream_descriptor_reg]
Issue: what aboyt index and offset addressing modes? Nicest if there are instructions "load/store from/to stream"
"
"
Data Manipulation Overhead
Users of SIMD packed vector instruction sets quickly come to realize that the actual memory access and computation is only a small part of the task of programming code to use such instructions. Often more than half of the instructions in such kernels are data manipulation overhead
unpacking 8 bit numbers into 16 bit numbers, or, worse, even fancier unpackings such as 5:6:5 transposng rows and columns chunky versus planar data, i.e. SOA versus AOS
Explicit support for streams via stream descriptors can eliminate much of this overhead, avoiding explicit instructions.
The CSI Multimedia Architecture goes further: not only does it hide the data packing and unpacking, but it also seeks to hide the SIMD packed vector nature of the hardware datapath. Instead of the programmer loading from a stream into a register of N bits, operating, and then storing, this paper proposes complex streaming instructions that are implemented N bits at a time in such a way that the code does not need to be changed when the microarchitecture changes the width of N.
Others have attempted to go even further, providing ways of expressing streaming kernels using simple instructions that automatically get mapped onto a fixed length datapath. "
"
The CSI Multimedia Architecture, IEEE T. VLSI, Jan 2005. Stream descriptors, plus optimizations to allow implicit SIMDification.
iWarp - stream descriptors bound to registers Wm - William Wulf's processor, firs instance I know of stream descriptors bound to register "
" Other security related instruction set extensions
separating the return address stack from data stack "
" More Types of Arithmetic Matrix Arithmetic
vector matrix matrix matrix transpose
Historically held back by register size; now, as we approach 512 bits = 16 32 bit numbers = 4x4 32b matrix ...
Copious possibilities for new instructions... Complex Arithmetic Quaternion Arithmetic
??? In for a penny, in for a pound (dropped off a bridge in Dublin) ??? LNS (logarithmic number systems)
In particular, instruction set support for the Gaussian logarithm which allows computation of A+B in log form, i.e. computing log(A+B) given log(A) + log(B).
LNS often is coupled to irregular LUT support. Lightweight threading instruction set extensions
Q: is MIPS MT such? Message passing instruction set extensions
send/receive, put/get PGAS (partitioned global address space)
See Intel GenX? Instruction Set#Message Gateway, although these messages seem to be used for communication within a GPU, at small scale, not at a wide scale between processors or systems. "
" Bit matrix multiply (BMM)
...
Form of Matrix Multiplication
One of the most natural forms is "conventional" AND-OR BMM:
for(i=0;i<63;i++) { for(tmp=0, j=0;j<63;j++) { tmp
a.bit[i] * b[i][j]; |
---|
} result.bit[i] = tmp; }
Another much loved form is AND-XOR (also called AND parity).
Yet another form is AND-POPCNT
for(i=0;i<63;i++) { for(tmp=0, j=0;j<63;j++) { tmp += a.bit[i] * b[i][j]; } result[i] = tmp; }
although here the result is not 64 bits, but at least 64*6 bits (typically 64*8).
Uses
The AND-OR form is a fairly general and standard way of expressing logic functions - sum-of-products. As is, for that matter, the OR-AND form - product-of-sums. Other forms may also allow expression of all logic functions, although they may be less familiar.
The AND-OR form can be used to accomplish arbitrary permutations, using matrices that have only a single bit per row and column. As can, for that matter, the AND-XOR form, although it may be necessary to invert after the permutation.
The AND-XOR form can be used for information theoretic calculations, as can POPCNT.
The AND-XOR form can be used as a fairly general form of XOR based hash function, as described by Vandierendonck and De Bosschere in "XOR-Based Hash Functions". Hardware
The AND-OR form is obviously easy to do with wired logic, and not so bad even with limited fan-in.
The XOR form is more expensive, requiring an XOR tree. POPCNT still more so, typically requiring an 8:3 counter (of which the XOR form needs only the lowest bit).
Because 8 levels of XOR can be hard, may motivate wider flatter formulations.
...
When writing this, I found the recent publication (2008):
http://palms.princeton.edu/system/files/hilewitz_asap_08.pdf
It is nice to see this "lore" of (super)computer architecture leaking into academia"
" Message Gateway
TBD
256 bit messages - GenXregister? size
OpenGateway CloseGatway ForwardMsg BarrierMsg counter based barriers implemented in the "Message Gateway" messaging unit, delaying response until all threads join UpdateGatewayState MMIOReadWrite provides a bridge from message passing to MMIO
The OpenGateway?/Closegateway amounts to the "channel" I have described in Message Passing Instructions. Indeed, GenX? uses exactly the term "channel". "
---
insight:
things like 'sin' that aren't needed on embedded systems but that might have hardward acceleration on some platforms could be in a std library, rather than a core instruction.