go home
        Fri Feb 02 16:28:25 CET 2002
 

o2

HTMLized by radrob@fsck.it

O2 Intro


[Picture of an O2 with monitor]


This page discusses SGI's O2 workstation, covering:





  • aspects of O2's design and the impact its architecture has on system performance, computational functions, exploitable hardware features, etc.

  • the various main CPUs available and how one should evaluate CPU differences, their importance, comparisons with other systems, etc.

  • the effects of screen resolution and colour depth on system performance and how this correlates with the O2's architectural design.

  • the benefits of O2's design when dealing with 3D scenes that involve complex geometry/lighting calculations, comparisons with other systems, etc.

  • ICE, the dedicated ASIC which O2 has for accelerating image/video tasks.

If you are researching the O2 system, note that the main index has other relevant information pages, SGI Performance Comparisons, Melting the ICE (image and video processing on O2) comparison between O2 and Indigo2 primitive graphics benchmark comparisons with all other SGI systems, and so on. Just about everything you could want to know really. I recommend reading all relevant material before making any purchasing or upgrade decisions.

HTMLized by radrob@fsck.it

o2 Architecture



o2 Architecture


The O2 workstation, first released on October 10th 1996, uses a system design entitled Unified Memory Architecture, or UMA for short (not to be confused with NUMA which refers to the Origin server architecture). SGI has their own White Paper on the subject of UMA design, which I strongly recommend you read. This page is the result of my own extensive investigative work.

Most existing (and older) workstations and computers (pre-1998) are based on 'bus' technologies, where data is moved around the system via a shared bus from one subsystem to another. Subsystems include main memory, CPU, graphics, texture, image, video, I/O connections, networking ports, etc. Sometimes, there is a fast link between CPU and main RAM, but even then the link that connects these two elements to the rest of the system (eg. the bridge + PCI in many PCs) is slow when judged by the demands of today's applications - painfully slow for some tasks.

The problem with the shared-bus design is that, as data bandwidth demands become greater and the nature of tasks increase in complexity, many tasks become difficult to accomplish since they require vast amounts of data to be moved around the system, eg. the use of an incoming video stream as a texture in a 3D model. The normal response to such problems has been to increase the clock speed of the bus or to make it wider, but there comes a point when the bandwidth gained is small and not worth the extra cost.

UMA solves this problem by having just one 'unified' high speed memory block. The heart of the system is no longer the main CPU; instead, the memory/graphics controller becomes the focus of attention. Understanding how O2's UMA system works, and what the memory/graphics controller can do, is the key to comprehending the things O2 can do which other systems such as traditional PCs cannot. Note, however, that as time moves on, traditional designs solve these problems using other methods, eg. the Intel i760 graphics ASIC can take a video stream directly into itself to use as a texture on a 3D model; however, if the main CPU wanted to carry out operations on that video stream, it would need to be copied to main memory first - this is not the case with O2.

UMA doesn't solve every problem, but when it comes to satisfying the demands placed on workstations by users, it is an ideal low-cost solution. However, there comes a point when a task's complexity is so great, eg. rendering 750MByte seismic data sets, that UMA is not an appropriate solution; thus, systems like Octane use a crossbar-based approach which offers massive system and memory bandwidth (Octane can render a 750MByte volumetric data set in just two seconds). At the high-end, the crossbar concept is combined with advanced interconnection technologies to offer the massively scalable systems known as Origin (file/data/web/media serving, number crunching) and Onyx2 (all the power of Origin combined with the fastest graphics designs in the world).

All these concepts share a common approach: the focus is on moving data around in the most efficient way possible, removing the bandwidth bottleneck which has existed for many years in the computing world, and allowing each computer subsystem to operate at its maximum potential.

Here is SGI's simple O2 diagram (note that the small annotated numbers do not refer to bus speeds or bandwidths. I belive they denote ASIC pin counts):

[O2 Block Diagram]

The above figure is rather sparse of details, but it's a good summary. However, for the complete picture with full details, here is SGI's full-blown O2 diagram, which even shows some of the board-level layout features:

[Detailed O2 Block Diagram]


HTMLized by radrob@fsck.it

UMA



Unified Memory Architecture

So how does UMA work? An example: suppose a video stream is brought in from the O2 digital camera and the data is stored in an area in the main RAM block (termed a 'Digital Media Buffer', or DMbuffer). If one then wished to use the data as a texture for a 3D model, all that needs to be done is to pass a pointer for the data area to the CRM chip, thus saving the need to copy the data as a video-as-texture (2.6MB MPEG) is trivial with O2.

Another example is volume rendering. Since there is only a single memory block, one therefore has access to effectively unlimited texture memory (ie. limited only by main RAM size). Thus, one can easily manipulate large textures sets, eg. 256MB of CAT scan data.

Of course, having virtually no limit to texture memory also benefits other areas such as visual simulation (many different textures are needing for landscapes, buildings, trees, etc.) The UMA design ensures fast and reliable access to the data, with 2.1GB/sec peak transfer rate between main RAM and the memory/graphics controller (CRM).

By way of a summary, here is an edited version of comments made by Tom Furlong (SGI's vice-president and general manager for desk-side systems) in an interview about the O2:

"We got rid of the bus because current bus based systems reached their [bandwidth] limit to CPU and standard I/O; when you add 3-D image processing or audio they start to fall apart, wasting precious bandwidth copying data around. O2 is based on a new unified memory architecture that puts 2.1 gigabytes of system bandwidth right where the computation is done, that's 20 times the bandwidth of today's [1996] fastest PC. 02 doesn't waste any of its bandwidth moving the data around the system. Instead it has multiple computational memory, the CPU coordinates the work of graphics I/O video compression to accomplish the computation without extraneous data movements; with 20 times bandwidth and no wasted data movements it can handle monstrously large data sets, movies, and hugely complicated special effects without missing a beat. To get the massive amounts of data O2 implements standard I/O that includes serial, parallel and embedded CD-ROM, two 40 Megabytes per second SCSI channels, auto sensing, 100 megabit ethernet connection, two channel audio I/O, two video channels, one video output and we have thrown in a 64-bit PCI for anything left out. It's really a technology tour de force."
HTMLized by radrob@fsck.it

CPU info



The Main CPU

The Main CPU Currently, the R5000 and R10000 CPUs can be used in O2. General information on these can be found on the main index (eg. comparisons with other systems, or different R10000 CPUs in one particular system,); the following information is more concerned with the specific use of these processors in O2.

Although the CRM ASIC handles most graphics functions in hardware, all geometry and lighting calculations are handled by the main CPU. AT first one might think this is a disadvantage, but its cheaper and also means there is an easy upgrade path to increased performance: just get a faster/newer CPU (there are other unexpected benefits too which are discussed later). To this end, the R5000 is a good solution: it has been specifically designed to handle operations that are typically found in 3D graphics tasks, eg. MADD instructions. Do not dismiss the R5000 just because it isn't an R10000.

Please examine the detailed test results for the use of these CPUs in O2 before forming any conclusions. Also read Byte's article on the R5000 examine the R10000 performance comparison pages for R10K/195 and R10K/250 , the SGI Performance Comparisons page, etc.

Here is a SPEC95 performance summary table (averages only) for R5000 and R10000 in O2:

                             SPECint95     SPECfp95

R5000PC 180MHz (no L2):         3.70         4.55
R5000SC 180MHz 512K L2:         4.82         5.42
R5000SC 200MHz 1MB L2:          5.40         5.70
R5200SC 300MHz 1MB L2:          8.04         6.86
R10000SC 195MHz 1MB L2:        10.10         8.77
R10000SC 250MHz 1MB L2:        12.10         9.71
R12000SC 300MHz 1MB L2:        14.49        10.42

Some people react with dissapointment when first examining the R10000's/R12000's floating point (fp) performance in O2, compared to other SGI systems which use R10000 such Octane and Origin (the discussion here will refer to R10000, but the same issues apply to R12000 in O2). There are several things to be said on this subject:

  • SPEC95 cannot properly measure the capabilities of O2. If you look at application performance (eg. AIM), the R10000 does quite well, especially for integer (int) tasks (O2 can be just as fast as Origin for int tasks). If you're interested in an R10000 configuration, you must decide whether the extra cost is worth the performance gain. Make sure you have your particular task tested before you buy. Operations such as off-screen rendering will benefit from an R10K, but don't think that O2 is a number crunching box because it isn't and was never designed to be. If fp number crunching is your main task, then you should be looking at Origin or Octane, not O2. Certainly, O2's int performance is greatly improved with an R10K, easily matching Octane, almost always beating older systems such as Power Challenge, Indigo2, etc. and often beating an R10K/180MHz Origin200 (in fact, for m88ksim on SPEC95 running on R10K/250, O2 outperformed an Origin2000!).

  • R10K was never designed for a memory system like UMA. R10K was designed for a faster memory system than is used in O2, such as is used in the Octane/Origin/Onyx2 line; O2's memory system runs at a lower clock speed and has higher memory latency properties than Origin.

  • CRM contains memory control circuitry for the R5K, but not for R10K. To compensate for this, R10K O2 systems have an extra ASIC on the R10K daughterboard to handle L2 cache requests. CRM was designed for 32byte cache refills, whereas R10K is designed for 64byte or 128byte refills. The extra ASIC converts R10K cache refill requests into multiple 32byte requests. This means extra time spent during cache misses. As a result, SPECfp95 results on O2 with R10K are much lower than R10K in other modern SGI systems, even though real world application performance can be 60% better than an R5K at a higher clock (SPEC95 punishes cache misses quite heavily). Also, R10K in O2 can only offer 1 outstanding cache miss, compared to 4 in Octane/Origin/Onyx2; cache-sensitive code will suffer because of this. Examine R10K/195 performance comparison analysis for complete details.

  • R10K does not help much for 3D graphics tasks that do not involve 64bit processing (eg. Gouraud shading). This is because most 3D graphics tasks require only single precision floating point (fp) computation (this especially applies to lighting and geometry). Thus, a 180MHz R5K will be faster than a 150MHz R10K for non-textured 3D graphics tasks. But at an equal clock speed, R10K will be about 25% faster than R5K for certain graphics tasks. Performance figures at a similar clock are higher for R10K partly because 3D graphics does involve a degree of int processing (eg. pointer chasing, array handling) and such int tasks are faster with R10K.

So how relevant is SPEC? A good example for me is that one of the int tests is JPEG compression. In O2, this operation is done in real-time by dedicated hardware ICE , so the SPEC result is of little value. Also, as far as the R5K is concerned, many SPEC tests employ double precision computation - something the R5K is not optimised for. The R5K does not have any special int optimisations. For R10K in O2, the int performance is very good, so O2 could easily act as a low-cost web server - indeed, I know of several institutions which use O2 for just this function.

Also, as far as I know, none of the SPECfp95 tests use the kind of single precision fp calculations that R5K was specficially designed for, namely MADD-style computation (the matrix math found in 3D graphics). See the Byte article for more information.

One must decide carefully whether the improved performance offered by an R10K is worth the extra cost, though R10K O2 systems have become considerably cheaper in the last year. Either way, always have your application tested before making any purchasing decision. It must be said though, R10K/R12K systems are definitely good for integer tasks and 2D work.


There are several additional aspects of the main CPU in O2 that are worthy of discussion.


The Impact of Screen Resolution on CPU Performance

Lower screen resolutions and shallower colour visuals will allow O2 to run some kinds of application faster. For example, on an R5000SC/200MHz O2, changing the screen resolution from 1280x1024 32+32 down to VGA16 improves the STREAM memory benchmark by 13 percent!

This effect may sound bizarre, but the explanation is quite simple and correlates correctly with the O2's UMA design (refer back to the architecture diagrams given earlier for clarification).

The CRM ASIC handles data transfers between itself and the:

  • main CPU,
  • Image Compression Engine (ICE),
  • Display Engine (DE),
  • and I/O Engine (IOE).

For most users, the vast majority of the available bandwidth from CRM will be used by the DE. A typical 32bit 1280x1024 72Hz display requires a bandwidth of 360MB/sec. While this data transfer is going on, the main CPU and other system components must utilise the remaining bandwidth. Thus, if one decreases the display complexity, there will be less data moving from CRM to DE and hence more opportunities for CRM to service the rest of the system. As a result, memory-intensive applications will speed up, and tasks such as video I/O won't have to compete to the same degree for bandwidth resources. There is always enough bandwidth to handle video I/O at real-time rates; the difference is that the IOE will be more likely to be able to transfer data when first requested (quicker response, better reliability, fewer conflicts with other data being moved around, lower possibility of error, etc.)

In fact, for someone who's main task is video I/O processing where the actual on-screen display isn't important (ie. only the video I/O signals matters, and perhaps parallel, serial, SCSI transfer too), it would be advantageous to be able to shutdown the CRM/DE data transfer completely, allowing other system components to make full use of the available bandwidth, especially the main CPU (eg. offscreen rendering would be quicker). I am currently investigating how this can be achieved. It is likely that using a VT terminal connected to the serial port, instead of using the main monitor, would be one way of achieving this, but at the moment I have no practical data to prove this. When I work out how to use an O2 via a VT terminal and can obtain a suitable VT, I'll run some tests.

Who cares? Why does it matter? Well, consider that a 13% performance increase is, on average, better than the performance increase obtained by upgrading from an 180MHz R5000SC CPU to an 200MHz R5000SC CPU. For some users, it could mean the difference between a task taking 45 minutes instead of 60 minutes. If one has many such tasks to run, that extra saving could be very useful to those with time constraints; for medical people, it could be a life-saver.

Observe how the STREAM bandwidth figures in the following diagram (MB/sec), for a 200MHz R5000SC O2 running IRIX 6.3, gradually improve (ie. increase) as the display complexity decreases:

Display Complexity      Copy      Scale     Add     Triad

1280-1024-32-32-75      69.2      69.1      69.6    70.7
1280-1024-32-32-60      72.1      71.5      72.8    73.8
1280-1024-32-32-50      74.5      73.5      72.7    73.7
1280-1024-32-32-48      75.4      74.5      74.1    75.1
1280-1024-16-16-75      74.1      73.1      73.4    74.9
1280-1024-16-0-75       76.6      75.5      74.3    76.2
1024-768-32-32-75       75.6      75.1      75.1    75.9
1024-768-32-32-60       76.8      76.2      75.8    77.0
1024-768-32-0-60        77.1      76.1      76.1    77.0
1024-768-16-0-60        80.8      78.9      79.3    80.3
800-600-32-32-72        79.5      78.1      76.5    77.9
800-600-32-32-60        80.5      78.6      78.7    80.0
640-480-32-32-60        82.5      80.2      80.8    82.0
640-480-16-16-60        82.7      81.3      82.1    83.0
640-480-16-0-60         83.2      81.2      82.2    83.2

I don't yet have any data for STREAM running on O2 when the display is a VT terminal. I would be most interested to hear from anyone who has run such a test, or from someone who has any idea how the CRM/DE data transfer could be shutdown under software control. Whoever figures out how to do this in a reliable manner will help thousands of users.

Over the next few weeks, I will be running other tests to see if this effect can be seen for every day tasks such 3D modeling, movie conversion, etc. Some tasks may be I/O-disk bound, in which case the display complexity will be irrelevant; other tasks may be compute bound - lowering the display to VGA16 could give a good speed increase. Note: forcing the monitor to go into power saving mode does not shut down the CRM/DE data transfer.

HTMLized by radrob@fsck.it

CPU perf




Geometry/Lighting and Comparing to Hardware Accelerated Systems

How can an R4600PC 100MHz Indy XL outperform an R4400SC 250MHz Indigo2 Elan for a 3D graphics task? Answer: when the 3D scene includes complex geometry and lighting calculations.

O2 does all geometry and lighting calculations in the main CPU. The same is true for Indy XL, Indigo2 XL, or any similar system such as Crimson Entry. At first this may sound like a disadvantage, but as main CPUs have improved in power, we have now entered an era where older systems with good main CPUs and no hardware graphics acceleration can easily outperform older systems with old types of hardware accelerator board (XS24, XZ, Elan and Extreme). When this situation occurs, it doesn't really matter what type of CPU is present in the system that has the hardware acceleration. The key point is that the main CPU in the former system has an effective fp performance that is better than the Geometry Engines (GEs) on the latter system's accelerator board.

The original XZ graphics offered 64MFLOPS of GE power; later revisions (seen as Elan by hinv on Indigo2, and XZ on Indy) offered 128MFLOPS, and Extreme offered 256MFLOPS. The R5000 CPU in Indy offered between 300MFLOPS and 360MFLOPS peak single-precision MADD performance.

Complex lighting calculations can hit these older accelerator boards (XZ/Elan/Extreme) hard. All older SGIs with hardware acceleration only support one hardware light (compare to InfiniteReality which supports four), so when multiple lights are present, the calculations become too complex, context switches occur because temporary data must be stored somewhere, the graphics board FIFOs fill up because the main CPU is sending in data faster than the board can process it, the CPU has to pause constantly to wait for the FIFOs to drain, and thus the GEs become the main bottleneck. In such situations, the main CPU may be little used - I saw only 2% CPU usage when running such a scenario on my Indigo2 Elan.

On the other hand, systems like XL offload all such calculations onto the main CPU. When things get tough, the main CPU runs as fast as it can just as always, hence the situation with FIFOs filling up, context switches occuring, etc. never happens and the system is ironically able to give a fair performance. That is how an Indy XL can outperform an Indigo2 Elan. It is also the reason why an R5000 Indy XZ can be slower than an R5000 Indy XL (the former must do its geometry/lighting calculations on the XZ board, completely wasting the much higher fp power of the main CPU).

What relevance is this to O2? Well, as CPUs improve in power, it is becoming very clear that O2 is almost certainly going to significantly outperform even a MaxIMPACT in the long term for certain types of task - it'll definitely be able to outperform a HighIMPACT or SolidIMPACT anyway. The reason is geometry/lighting: as the main CPUs for O2 improve, the single-precision fp performance will eventually exceed that offered by the GEs of systems like SolidIMPACT (480MFLOPS), High IMPACT (480MFLOPS) and Max IMPACT (960MFLOPS). A 300MHz R12000 would offer 600MFLOPS, so I would expect an R12K/300 O2 to outperform a SolidIMPACT Indigo2 for tasks involving complex geometry and lighting (especially something like multiple spotlights). Casting one's imagination ahead, I would fully expect something like (say) a 500MHz R14K O2 to thoroughly outperform an Octane/SI (at least 1GFLOP for the O2 compared to half that for the Octane/SI's GEs).

These effects will be important unless the bottleneck becomes something else such as:

  • pixel fill,
  • texture bandwidth,
  • memory bandwidth/latency,
  • etc.

These could be important if, for example, the 3D scene contained a very large number of polygons, or a complex dynamic scene. My comments above mainly refer to scenes that involve multiple lights and low polygon counts, eg. VRML worlds.

For a more thorough investigation and discussion of these issues, please see my HolliDance Benchmark page, which includes a table of example performance results for a typical dynamic 3D real-time scene that contains complex lighting. If you own or have access to an SGI, please consider submitting a set of results as I am convinced that, for relevant tasks, the HolliDance Benchmark results table will be a very useful resource to 2nd-hand buyers and those considering upgrades from older systems. It should also be useful to users of faster systems who may be interested in possible performance degradation when the number of lights reaches a certain threshold (see the benchmark page for more details).

When thinking about O2, these issues may be important if your task involves real-time 3D animation, VRML, low-end visual-simulation, etc. It could be especially relevant if you have an older system, are considering an upgrade, and aren't sure whether to go for something like an Indigo2 Extreme/IMPACT, an O2, or an entry-level Octane. It's quite surprising to think that O2 could gradually be seen to outperform many existing SGIs for tasks that involve complex lighting. However, I doubt this will occur with Onyx2 since IR supports four hardware lights and the GEs offer 2.56GFLOPS of processing power - much greater than even a theoretical 800MHz R14K (unless such a future CPU was able to do 4 fp operations per clock instead of 2 fp operations per clock).

With hindsight, and certainly for particular types of O2 user (eg. anyone doing VRML), the fact that O2 does all geometry/lighting calculations in software could prove very advantageous in terms of much greater performance in the future. Note that this kind of task is very different from the typical 'primitive' level benchmarks shown on technical reports and PR web sites. Such simplistic performance figures (eg. flat tris/sec, or lit, shaded, textured triangles/sec) almost always involve either no lighting whatsoever, or just a single directional light, thus hardware acceleration boards never experience the problem of having to deal with more light sources than can be handled by the hardware at one time. A good example of this that although an R4600PC 100MHz Indy XL outperforms an R4400SC 250MHz Indigo2 Elan for the HolliDance 3D animation program by 8 percent (large window, no texture), if one turns all the lights off then the Indigo2 immediately becomes 158% quicker than the Indy.

What I've tried to highlight here is that you should be very wary of assuming O2 must be better or worse than older systems simply because it's newer, has a better main CPU, etc. The reality may be much more complex because of the way graphics hardware works and how the different components of a system interact, coupled with the fact that different systems often work in very different ways.

For example, one might assume that an O2 should outperform an Indigo2 Extreme for Gouraud shaded tasks, and indeed it does on the primitive level benchmarks by a moderate to reasonable margin (between 7 and 65 percent for various CPUs); but what might be a surprise to many is that O2 can completely stomp over an Indigo2 Extreme for a 3D task that involves multiple lights. For the HolliDance benchmark, compared to R4400SC/250MHz Indigo2 Elan, the O2 was 510 percent faster! The primitive level benchmarks would have suggested a difference of around 170%.

But turn off the lights and the difference changes drastically: O2 is now 144% faster than Indigo2 Elan, a figure which correlates much better with the primitives tests. In other words, when the complex lighting is turned off, both systems speed up, but Indigo2 Elan speeds up by a much greater degree (300% compared to 60%) because all the horrible bottlenecks concerning the GEs are removed, though it's still slower overall. Obviously, I would expect the differences between O2 and Indigo2 Extreme for HolliDance to be less, but I reckon O2 would still be at least 200% quicker when the lights are turned on (as opposed to the 20% difference one might expect from the primitives tests).

3D graphics is a strange thing. Yet again, this is more proof, if any were needed, that the only benchmark test one should really trust when making a purchasing decision is one's own application.


HTMLized by radrob@fsck.it

ICE



ICE consists of two parts:

66MHz 64bit R4K-derived control logic unit plus a 66MHz SIMD 128bit MDMX-style central processing unit. The SIMD core can do sixteen 8bit MACs or eight 16bit MACs/clock. Each MAC is 2 operations (multiply + add), so 66M * 2 * 16 = 2billion operations/sec. So, the MAC figure is 1 billion MACs/sec for 8bit integer ops and 500 million MACs/sec for 16 bit integer ops (ICE cannot be used for fp computation). The unit as a whole is designed to handle multiple data streams.

The controller element is programmable, to allow for future video and image formats - this means it's likely that the unit is perfectly capable of doing four 32bit ops or two 64bit ops per clock, but I don't think the current libraries support such operations since today's video/image tasks don't need them.

ICE allows one to do some impressive real-time image and video operations, some of which are shown in the various O2 demo programs. Real-time examples include: edge detection, colour space conversion, luma and chroma keying, etc. For a more thorough description of ICE, please see my main ICE page.

Incidentally, because of the many questions about ICE that I've thrown at people in SGI, a member of SGI's Global Technical Support has begun the process of writing a proper report on ICE for a future issue of Pipeline

Finally, here is SGI's own description of the ICE system

The following table lists some key digital media hardware specs of the o2 O2 :

               O2                                

   Image and Compression Engine
   (ICE):                             
                                        
   * Built-in motion JPEG video         
     compression/decompression          
   * Built-in imaging                 
     acceleration                       
                                        
     Video input and output             
               
     Screen-capture video source        
     (graphics screen available as      
     video input device)

     Improved digital video camera      
     with built-in microphone and       
     shutter button

Silicon Graphics is also releasing IRIXTM 6.3 for O2. This updated OS version has the following new elements:

   * New digital media buffer (DMbuffer) programming 
     interface for sharing unified memory among the 
     application, video I/O devices, compression, 
     graphics rendering, and graphics display
   * New Video Library (VL) programming interface 
     to DMbuffers
   * New digital media image conversion (dmIC) 
     programming interface based on DMbuffer for 
     direct data transfer among image-conversion 
     algorithms/devices, video I/O, and graphics
   * Hardware-accelerated OpenGL imaging extensions

Audio and Video I/O Ports

The following I/O devices transfer audio samples and video pixels into and out of main system memory:

   * Camera and camera microphone
   * Two line-level analog stereo outputs and 
     one line-level analog stereo input
   * S-video and composite video in/out
   * Headphones out
   * Microphone in (mono)
   * Speaker output
   * Optional CCIR 601 digital video adapter in/out

Digital Media Buffer Architecture

The DMbuffer is a new API for programmatic access to a new IRIX operating system feature that unifies the memory buffering systems of live video devices, such as video input and output and image compression and decompression. Also, OpenGL can both read from and render to the DMbuffer system, thus enabling completely programmable video effects: anything that you can render to a window you can also render offscreen and send directly to video output or compression. Furthermore, video input and decompression output are available for graphics display.

HTMLized by radrob@fsck.it

Image Engine



The software architecture consists of the following elements:

   * DMbuffer
   * Ability to treat DMbuffer data as pbuffer or 
     texture map data in OpenGL
   * VL receive/send DMbuffers to/from video 
     I/O hardware
   * ICE (Image and Compression Engine) uses 
     DMbuffers for input and output
   * New Digital Media Library (libdmedia) 
     image conversion API (dmIC)
Image Processing Engine ICE is a chip, and digital media image conversion (dmIC) is a software interface. Together, these two components enable video compression/decompression functions; they also allow applications to display multiple image streams.

The ICE chip contains the following components:


   * MIPS RISC core for program control
   * Integer vector unit capable of 8 
     multiply-accumulates per clock
   * Bit stream encoder and decoder
   * Intelligent DMA controller

These features are tied together with highly optimized code for applications such as JPEG encode and decode, general and separable convolutions, color matrix multiplies, and histogram generation.

Providing the functionality of the Cosmo CompressTM option card for Indy, ICE is even more flexible than its predecessor. In addition to handling single streams of live video, ICE is easily shared between multiple smaller streams (of any size and rate); for example: 4 quarter-size, full-rate streams are supported as easily as 1 full-size, full-rate (or 2 half-size, or 3 third-size, or 2 full-size, half-rate, and so on). Since there is no built-in video clock or video dimensionality on the ICE chip, you can also use for non-standard sizes and rates; for example, film aspect ratio at film rate for film animation preview to the graphics monitor.

With the Indy, all imaging and compression calculations were done by the main CPU. ICE, which functions as a separate CPU, now handles these calculations, which frees the main CPU to handle other processes. Also with the Indy, you had to purchase dedicated cards, such as a JPEG card, to handle jobs such as compression. Silicon Graphics designed O2 with flexibility as a key objective. Consequently, the system can handle JPEG compression as well as image-processing functions, without having to purchase dedicated cards for each process.

The IO Engine (IOE) is a chip that brings video and audio into and out of the system. Both IOE and ICE feature direct memory access (DMA) controllers, which enables them to read compressed images and output the information to a video out channel.

Not only do IOE, ICE, and UMA simplify the sharing of digital media data between subsystems, their interaction is many times faster than more common methods of transferring data between subsystems over a system bus.

New Image Conversion API

The Digital Media Library (libdmedia) that's included with IRIX 6.3 features a new digital media image conversion library (dmIC). You use this low-level API for memory-to-memory image compression/decompression and conversion. dmIC supports the standard software image codecs supported by the older Compression Library (libcl) interface in IRIX 6.2 and earlier releases. dmIC also supports the real-time motion JPEG encode/decode capability of the O2 ICE processor:


   * The dmIC interface makes software image 
     codecs and hardware-accelerated memory-to-memory 
     codecs look the same to application developers.

   * dmIC operates on image data stored in DMbuffers.
     This makes it possible to share image data 
     between hardware or software codecs and OpenGL
     or the Video Library, without
     copying data.

   * dmIC does not support in-line compression devices
     that are integrated into video capture or playback 
     hardware paths; for example Cosmo Compress, Impact
     Compress. These kinds of devices require a slightly 
     different programming model from the model used to 
     send data to and receive data from an asynchronous 
     memory-to-memory processor. The older libcl 
     continues to provide the applications
     programming interface to these kinds of devices.

   * An application can query dmIC to determine 
     whether the current system offers a real-time
     implementation of a particular memory-to-memory 
     codec; for example, JPEG.
     The real-time JPEG codec on O2 supports full-rate
     encode/decode at NTSC/PAL square pixel, 
     CCIR 601/525, and CCIR 601/625 video timings. 
     On systems that are not equipped with a real-time
     memory-to-memory codec, an application can
     also use the non-real-time software implementation.

   * The Compression Library functionality offered in 
     IRIX 6.2 will continue to be supported in IRIX 6.3 
     and future releases in order to ensure backward 
     compatibility for applications.

   * Starting with IRIX 6.3, MPEG audio/video encode 
     and Cinepak encode capabilities are bundled 
     with every Silicon Graphics system. 
     These software encoders no longer require a Silicon
     Graphics run-time license.

The new dmIC routines are declared in the public header . The new DMbuffer routines for creating and manipulating DMbuffers are declared in .

OpenGL Extensions for Image Data

Silicon Graphics created OpenGL extensions for O2, which allow you to use DMbuffers as either pbuffers or texture maps. The company also designed an OpenGL extension for rendering YCrCb (4:2:2) interlaced data, which lets you save video display pixels in a pixel format, rather than converting them to bits. Using these extensions, you can also perform hardware color space conversions from YCrCb to RGB.

In addition to the new OpenGL extensions, O2 provides hardware acceleration for the following existing extensions:


   * Color scale and bias
   * Color table look-ups
   * Convolutions: 3x3, 5x5, and 7x7
    (separable and general)
   * Color matrix multiply
   * Histogram and MinMax

The support of these operations should promote interesting applications, with real-time feedback (attributable to the performance increase), in the fields of medical imaging, GIS, and post production. Moreover, the support of a common API (OpenGL) enables applications to run across the product line, with performance gains associated with the platform on which the applications are running.

DMcolor and OpenGL Color Matrix Extensions

With O2, you can use OpenGL hardware to perform transforms. In addition, DMcolor can set up transform matrices that the application can pass to OpenGL. The system also has a software image color space conversion engine in libdmedia. The system also has a DMcolor API.

Video Library and DMbuffers

The system has new Video Library (VL) calls for receiving video data (fields or pairs of fields interleaved to form frames) into DMbuffers, and for sending video data using DMbuffers. In addition, the video I/O path can handle mipmap generation for live video. The older VLbuffer interface is still supported as well.

Audio Library Enhancements

Starting with IRIX 6.3, the Audio Library (AL) is packaged as a DSO rather than as a static library. The Audio Library adds a number of new functions and features, however the 6.3 version of the library is backward-compatible with previous releases.

New features in 6.3 include:


   * The ability to support multiple audio I/O 
     devices in a single system.

   * Support for the O2 workstation's ability to 
     lock audio and video sample rates together in 
     hardware to prevent drift during synchronized
     audio/video recording or playback.

In addition, IRIX 6.3 introduces a new, generalized version of the Audio Control Panel, which can automatically configure itself when you add audio I/O devices to the system.

High-Resolution Timer for Synchronizing Audio and Video Streams

The O2 workstation includes audio/video hardware support for Silicon Graphics' high-resolution digital media timer, the unadjusted system time (UST) clock.

UST provides a common time base for timestamping audio samples and video fields as they enter or leave the system through the audio/video I/O ports. AL and VL each support timestamps based on the UST clock. Applications can use this common timebase to correlate and synchronize outgoing audio and video input/output streams. Refer to the man pages alGetFrameTime(3dm) and vlGetUSTMSCPair(3dm) for more information.

The O2 architecture makes the high-resolution UST clock visible to PCI option cards as well as to the audio/video subsystems that are standard on the system.

Movie Library Enhancements

Starting with IRIX 6.3, the Movie Library is packaged as a pair of DSOs rather than as a single static library. The Movie Library API is backward-compatible with previous IRIX releases:

   * Movie file library (libmoviefile.so) deals with
     movie file reading, writing and editing. This 
     DSO includes the functions defined in the public 
     header .

   * Movie playback library (libmovieplay.so) provides 
     high-level functions for movie playback with 
     synchronized sound and images. This DSO includes 
     the functions defined in the public header 
     .

The IRIX 6.3 version of the Movie Library offers the following new features:

   * Support for Indeo encoding and writing AVI files

   * Support for creating MPEG-1 video and 
     systems bitstreams through the movie file 
     library interface

   * Support for full-rate, full-resolution motion 
     JPEG playback with synchronized audio by using
     the real-time JPEG decode capabilities of the
     O2 ICE processor

   * Ability to take advantage of the OpenGL 
     extensions for rendering interlaced image 
     data and YCrCb image data on O2

New Audio Conversion API

The Digital Media Library (libdmedia) that's included with IRIX 6.3 features a new digital media audio conversion library (dmAC). You use this low-level API for memory-to-memory audio sample format conversion, sample rate conversion, and compression/decompression.

dmAC supports these audio conversion operations:


   * Sample data format conversion 
   (signed, unsigned, float, double, scaling)

   * Sample rate conversion 
   (several algorithms)
   
   * Channel conversion 
   (mono, stereo, 4-channel, and so on)
   
   * Compression/decompression

IRIX 6.3 supports the following audio compression algorithms:

   * CCITT G.711 mu-law and A-law
   * CCITT G.722
   * CCITT G.726 16, 24, 32, and 40 Kb/sec
   * CCITT G.728
   * GSM
   * Intel DVI ADPCM
   * MPEG audio

All of the audio compression/decompression and conversion algorithms are implemented in software. No special option hardware is required to perform these conversions.

The new dmAC routines are declared in the public header .

Starting with IRIX 6.3, MPEG audio encoding is bundled with all systems and no longer requires a license from Silicon Graphics.

Audio File Library Enhancements

The new version of the Audio File Library (libaudiofile) included in IRIX 6.3 offers support for several additional sound file formats:

   * Amiga IFF/8SVX
   * SampleVision
   * Audio Visual Research
   * Creative Labs VOC
   * Creative Labs SoundFont2

The library now offers transparent sample rate conversion in addition to transparent sample format conversion and compression/decompression. You can specify a virtual sample rate from within your application; for example, 48 kHz. The application can open sound files that contain data sampled at a variety of rates, and the library automatically converts between the sample rates used in the sound files (such as 44.1 kHz, 32 kHz, or 16 kHz) and the virtual sample rate that the application requests.

HTMLized by radrob@fsck.it
mirror v.0.1.8

O2 tech page



all contents by Ian Mapleson BSc

design 1997 - 2002 Fsck.it
Last updated 20020503
SGI, O2, UMA, NUMA, PUMA, and probably some other terms are trademarks of SGI (Silicon Graphics, Inc.).

contact UNOFFICIAL o2 tech page






design © 2002 fsck.it