HTMLized by radrob@fsck.it
O2 Intro
![[Picture of an O2 with monitor]](image/overview_new2.jpg) This page
discusses SGI's O2 workstation, covering:
- aspects of O2's design and the impact its architecture has on system
performance, computational functions, exploitable hardware features, etc.
- the various main CPUs available and how one should evaluate CPU
differences, their importance, comparisons with other systems, etc.
- the effects of screen resolution and colour depth on system performance
and how this correlates with the O2's architectural design.
- the benefits of O2's design when dealing with 3D scenes that involve
complex geometry/lighting calculations, comparisons with other systems, etc.
- ICE, the dedicated ASIC which O2 has for accelerating image/video tasks.
If you are researching the O2 system, note that the main index has other
relevant information pages, SGI Performance
Comparisons, Melting the ICE (image and video processing on O2)
comparison between O2 and Indigo2 primitive graphics
benchmark comparisons with all other SGI systems, and so on. Just about
everything you could want to know really. I recommend reading all relevant
material before making any purchasing or upgrade decisions.
HTMLized by radrob@fsck.it
o2 Architecture
o2 Architecture
The O2 workstation, first released on October 10th 1996, uses a system
design entitled Unified Memory Architecture, or UMA for short (not to be
confused with NUMA which refers to the Origin server architecture). SGI has
their own White Paper on
the subject of UMA design, which I strongly recommend you read. This page is the
result of my own extensive investigative work.
Most existing (and older) workstations and computers (pre-1998) are based on
'bus' technologies, where data is moved around the system via a shared bus from
one subsystem to another. Subsystems include main memory, CPU, graphics,
texture, image, video, I/O connections, networking ports, etc. Sometimes, there
is a fast link between CPU and main RAM, but even then the link that connects
these two elements to the rest of the system (eg. the bridge + PCI in many PCs)
is slow when judged by the demands of today's applications - painfully slow for
some tasks.
The problem with the shared-bus design is that, as data bandwidth demands
become greater and the nature of tasks increase in complexity, many tasks become
difficult to accomplish since they require vast amounts of data to be moved
around the system, eg. the use of an incoming video stream as a texture in a 3D
model. The normal response to such problems has been to increase the clock speed
of the bus or to make it wider, but there comes a point when the bandwidth
gained is small and not worth the extra cost.
UMA solves this problem by having just one 'unified' high speed memory block.
The heart of the system is no longer the main CPU; instead, the memory/graphics
controller becomes the focus of attention. Understanding how O2's UMA system
works, and what the memory/graphics controller can do, is the key to
comprehending the things O2 can do which other systems such as traditional PCs
cannot. Note, however, that as time moves on, traditional designs solve these
problems using other methods, eg. the Intel i760 graphics ASIC can take a video
stream directly into itself to use as a texture on a 3D model; however, if the
main CPU wanted to carry out operations on that video stream, it would need to
be copied to main memory first - this is not the case with O2.
UMA doesn't solve every problem, but when it comes to satisfying the demands
placed on workstations by users, it is an ideal low-cost solution. However,
there comes a point when a task's complexity is so great, eg. rendering 750MByte
seismic data sets, that UMA is not an appropriate solution; thus, systems like
Octane use a crossbar-based approach which offers massive system and memory bandwidth (Octane
can render a 750MByte volumetric data set in just two seconds). At the high-end,
the crossbar concept is combined with advanced interconnection technologies to
offer the massively scalable systems known as Origin (file/data/web/media
serving, number crunching) and Onyx2 (all the power of Origin combined with the
fastest graphics designs in the world).
All these concepts share a common approach: the focus is on moving data
around in the most efficient way possible, removing the bandwidth bottleneck
which has existed for many years in the computing world, and allowing each
computer subsystem to operate at its maximum potential.
Here is SGI's simple O2 diagram (note that the small annotated numbers do not
refer to bus speeds or bandwidths. I belive they denote ASIC pin counts):
The above figure is rather sparse of details, but it's a good
summary. However, for the complete picture with full details, here is SGI's
full-blown O2 diagram, which even shows some of the board-level layout features:
HTMLized by radrob@fsck.it
UMA
Unified Memory Architecture
So how does UMA work? An example: suppose a video stream is brought in from
the O2 digital camera and the data is stored in an area in the main RAM block
(termed a 'Digital Media Buffer', or DMbuffer). If one then wished to use the
data as a texture for a 3D model, all that needs to be done is to pass a pointer
for the data area to the CRM chip, thus saving the need to copy the data as a video-as-texture
(2.6MB MPEG) is trivial with O2.
Another example is volume rendering. Since there is only a single memory
block, one therefore has access to effectively unlimited texture memory (ie.
limited only by main RAM size). Thus, one can easily manipulate large textures
sets, eg. 256MB of CAT scan data.
Of course, having virtually no limit to texture memory also benefits other
areas such as visual simulation (many different textures are needing for
landscapes, buildings, trees, etc.) The UMA design ensures fast and reliable
access to the data, with 2.1GB/sec peak transfer rate between main RAM and the
memory/graphics controller (CRM).
By way of a summary, here is an edited version of comments made by Tom
Furlong (SGI's vice-president and general manager for desk-side systems) in an
interview about the O2:
-
- "We got rid of the bus because current bus based systems reached their
[bandwidth] limit to CPU and standard I/O; when you add 3-D image processing
or audio they start to fall apart, wasting precious bandwidth copying data
around. O2 is based on a new unified memory architecture that puts 2.1
gigabytes of system bandwidth right where the computation is done, that's 20
times the bandwidth of today's [1996] fastest PC. 02 doesn't waste any of its
bandwidth moving the data around the system. Instead it has multiple
computational memory, the CPU coordinates the work of graphics I/O video
compression to accomplish the computation without extraneous data movements;
with 20 times bandwidth and no wasted data movements it can handle monstrously
large data sets, movies, and hugely complicated special effects without
missing a beat. To get the massive amounts of data O2 implements standard I/O
that includes serial, parallel and embedded CD-ROM, two 40 Megabytes per
second SCSI channels, auto sensing, 100 megabit ethernet connection, two
channel audio I/O, two video channels, one video output and we have thrown in
a 64-bit PCI for anything left out. It's really a technology tour de force."
HTMLized by radrob@fsck.it
CPU info
The Main CPU
The Main CPU Currently, the R5000 and
R10000 CPUs can be used in O2. General information on these can be found on the
main index (eg.
comparisons with other systems, or different R10000 CPUs in one particular
system,); the following information is more concerned with the specific use of
these processors in O2.
Although the CRM ASIC handles most graphics functions in hardware, all
geometry and lighting calculations are handled by the main CPU. AT first one
might think this is a disadvantage, but its cheaper and also means there is an
easy upgrade path to increased performance: just get a faster/newer CPU (there
are other unexpected benefits too which are discussed later). To this end, the
R5000 is a good solution: it has been specifically designed to handle operations
that are typically found in 3D graphics tasks, eg. MADD instructions. Do not
dismiss the R5000 just because it isn't an R10000.
Please examine the detailed test results
for the use of these CPUs in O2 before forming any conclusions. Also read Byte's article on the
R5000 examine the R10000 performance comparison pages for R10K/195 and R10K/250 , the SGI Performance
Comparisons page, etc.
Here is a SPEC95 performance summary table (averages only) for R5000 and
R10000 in O2:
-
SPECint95 SPECfp95
R5000PC 180MHz (no L2): 3.70 4.55
R5000SC 180MHz 512K L2: 4.82 5.42
R5000SC 200MHz 1MB L2: 5.40 5.70
R5200SC 300MHz 1MB L2: 8.04 6.86
R10000SC 195MHz 1MB L2: 10.10 8.77
R10000SC 250MHz 1MB L2: 12.10 9.71
R12000SC 300MHz 1MB L2: 14.49 10.42
Some people react with dissapointment when first examining the
R10000's/R12000's floating point (fp) performance in O2, compared to other SGI
systems which use R10000 such Octane and Origin (the discussion here will refer
to R10000, but the same issues apply to R12000 in O2). There are several things
to be said on this subject:
- SPEC95 cannot properly measure the capabilities of O2. If you look at
application performance (eg. AIM), the R10000 does quite well,
especially for integer (int) tasks (O2 can be just as fast as Origin for int
tasks). If you're interested in an R10000 configuration, you must decide
whether the extra cost is worth the performance gain. Make sure you have your
particular task tested before you buy. Operations such as off-screen
rendering will benefit from an R10K, but don't think that O2 is a number
crunching box because it isn't and was never designed to be. If fp number
crunching is your main task, then you should be looking at Origin or
Octane, not O2. Certainly, O2's int performance is greatly improved with an
R10K, easily matching Octane, almost always beating older systems such as
Power Challenge, Indigo2, etc. and often beating an R10K/180MHz Origin200 (in
fact, for m88ksim on SPEC95 running on R10K/250, O2 outperformed an
Origin2000!).
- R10K was never designed for a memory system like UMA. R10K was designed
for a faster memory system than is used in O2, such as is used in the
Octane/Origin/Onyx2 line; O2's memory system runs at a lower clock speed and
has higher memory latency properties than Origin.
- CRM contains memory control circuitry for the R5K, but not for R10K. To
compensate for this, R10K O2 systems have an extra ASIC on the R10K
daughterboard to handle L2 cache requests. CRM was designed for 32byte cache
refills, whereas R10K is designed for 64byte or 128byte refills. The extra
ASIC converts R10K cache refill requests into multiple 32byte requests. This
means extra time spent during cache misses. As a result, SPECfp95 results on
O2 with R10K are much lower than R10K in other modern SGI systems, even though
real world application performance can be 60% better than an R5K at a higher
clock (SPEC95 punishes cache misses quite heavily). Also, R10K in O2 can only
offer 1 outstanding cache miss, compared to 4 in Octane/Origin/Onyx2;
cache-sensitive code will suffer because of this. Examine R10K/195 performance
comparison analysis for complete details.
- R10K does not help much for 3D graphics tasks that do not involve 64bit
processing (eg. Gouraud shading). This is because most 3D graphics tasks
require only single precision floating point (fp) computation (this especially
applies to lighting and geometry). Thus, a 180MHz R5K will be faster
than a 150MHz R10K for non-textured 3D graphics tasks. But at an equal clock
speed, R10K will be about 25% faster than R5K for certain graphics tasks.
Performance figures at a similar clock are higher for R10K partly because 3D
graphics does involve a degree of int processing (eg. pointer chasing, array
handling) and such int tasks are faster with R10K.
So how relevant is SPEC? A good example for me is that one of
the int tests is JPEG compression. In O2, this operation is done in real-time by
dedicated hardware ICE , so the SPEC result
is of little value. Also, as far as the R5K is concerned, many SPEC tests employ
double precision computation - something the R5K is not optimised for. The R5K
does not have any special int optimisations. For R10K in O2, the int performance
is very good, so O2 could easily act as a low-cost web server - indeed, I know
of several institutions which use O2 for just this function.
Also, as far as I know, none of the SPECfp95 tests use the kind of single
precision fp calculations that R5K was specficially designed for, namely
MADD-style computation (the matrix math found in 3D graphics). See the Byte article for more
information.
One must decide carefully whether the improved performance offered by an R10K
is worth the extra cost, though R10K O2 systems have become considerably cheaper
in the last year. Either way, always have your application tested before making
any purchasing decision. It must be said though, R10K/R12K systems are
definitely good for integer tasks and 2D work.
There are several additional aspects of the main CPU in O2 that are
worthy of discussion.
The Impact of Screen Resolution on CPU Performance
Lower screen resolutions and shallower colour visuals will allow O2 to run
some kinds of application faster. For example, on an R5000SC/200MHz O2, changing
the screen resolution from 1280x1024 32+32 down to VGA16 improves the STREAM memory benchmark by 13
percent!
This effect may sound bizarre, but the explanation is quite simple and
correlates correctly with the O2's UMA design (refer back to the architecture
diagrams given earlier for clarification).
The CRM ASIC handles data transfers between itself and the:
- main CPU,
- Image Compression Engine (ICE),
- Display Engine (DE),
- and I/O Engine (IOE).
For most users, the vast majority of the available bandwidth from CRM will be
used by the DE. A typical 32bit 1280x1024 72Hz display requires a bandwidth of
360MB/sec. While this data transfer is going on, the main CPU and other system
components must utilise the remaining bandwidth. Thus, if one decreases the
display complexity, there will be less data moving from CRM to DE and hence more
opportunities for CRM to service the rest of the system. As a result,
memory-intensive applications will speed up, and tasks such as video I/O won't
have to compete to the same degree for bandwidth resources. There is always
enough bandwidth to handle video I/O at real-time rates; the difference is that
the IOE will be more likely to be able to transfer data when first requested
(quicker response, better reliability, fewer conflicts with other data being
moved around, lower possibility of error, etc.)
In fact, for someone who's main task is video I/O processing where the actual
on-screen display isn't important (ie. only the video I/O signals matters, and
perhaps parallel, serial, SCSI transfer too), it would be advantageous to be
able to shutdown the CRM/DE data transfer completely, allowing other system
components to make full use of the available bandwidth, especially the main CPU
(eg. offscreen rendering would be quicker). I am currently investigating how
this can be achieved. It is likely that using a VT terminal connected to the
serial port, instead of using the main monitor, would be one way of achieving
this, but at the moment I have no practical data to prove this. When I work out
how to use an O2 via a VT terminal and can obtain a suitable VT, I'll run some
tests.
Who cares? Why does it matter? Well, consider that a 13% performance increase
is, on average, better than the performance increase obtained by
upgrading from an 180MHz R5000SC CPU to an 200MHz R5000SC CPU. For some users,
it could mean the difference between a task taking 45 minutes instead of 60
minutes. If one has many such tasks to run, that extra saving could be very
useful to those with time constraints; for medical people, it could be a
life-saver.
Observe how the STREAM bandwidth figures in the following diagram (MB/sec),
for a 200MHz R5000SC O2 running IRIX 6.3, gradually improve (ie. increase) as
the display complexity decreases:
-
Display Complexity Copy Scale Add Triad
1280-1024-32-32-75 69.2 69.1 69.6 70.7
1280-1024-32-32-60 72.1 71.5 72.8 73.8
1280-1024-32-32-50 74.5 73.5 72.7 73.7
1280-1024-32-32-48 75.4 74.5 74.1 75.1
1280-1024-16-16-75 74.1 73.1 73.4 74.9
1280-1024-16-0-75 76.6 75.5 74.3 76.2
1024-768-32-32-75 75.6 75.1 75.1 75.9
1024-768-32-32-60 76.8 76.2 75.8 77.0
1024-768-32-0-60 77.1 76.1 76.1 77.0
1024-768-16-0-60 80.8 78.9 79.3 80.3
800-600-32-32-72 79.5 78.1 76.5 77.9
800-600-32-32-60 80.5 78.6 78.7 80.0
640-480-32-32-60 82.5 80.2 80.8 82.0
640-480-16-16-60 82.7 81.3 82.1 83.0
640-480-16-0-60 83.2 81.2 82.2 83.2
I don't yet have any data for STREAM running on O2 when the display is a VT
terminal. I would be most interested to hear from anyone who has run such a
test, or from someone who has any idea how the CRM/DE data transfer could be
shutdown under software control. Whoever figures out how to do this in a
reliable manner will help thousands of users.
Over the next few weeks, I will be running other tests to see if this effect
can be seen for every day tasks such 3D modeling, movie conversion, etc. Some
tasks may be I/O-disk bound, in which case the display complexity will be
irrelevant; other tasks may be compute bound - lowering the display to VGA16
could give a good speed increase. Note: forcing the monitor to go into power
saving mode does not shut down the CRM/DE data transfer.
HTMLized by radrob@fsck.it
CPU perf
Geometry/Lighting and Comparing to Hardware Accelerated Systems
How can an R4600PC 100MHz Indy XL outperform an R4400SC 250MHz Indigo2
Elan for a 3D graphics task? Answer: when the 3D scene includes complex geometry
and lighting calculations.
O2 does all geometry and lighting calculations in the main CPU. The same is
true for Indy XL, Indigo2 XL, or any similar system such as Crimson Entry. At
first this may sound like a disadvantage, but as main CPUs have improved in
power, we have now entered an era where older systems with good main CPUs and
no hardware graphics acceleration can easily outperform older systems
with old types of hardware accelerator board (XS24, XZ, Elan and Extreme). When
this situation occurs, it doesn't really matter what type of CPU is present in
the system that has the hardware acceleration. The key point is that the main
CPU in the former system has an effective fp performance that is better than the
Geometry Engines (GEs) on the latter system's accelerator board.
The original XZ graphics offered 64MFLOPS of GE power; later revisions (seen
as Elan by hinv on Indigo2, and XZ on Indy) offered 128MFLOPS, and Extreme
offered 256MFLOPS. The R5000 CPU in Indy offered between 300MFLOPS and 360MFLOPS
peak single-precision MADD performance.
Complex lighting calculations can hit these older accelerator boards
(XZ/Elan/Extreme) hard. All older SGIs with hardware acceleration only support
one hardware light (compare to InfiniteReality which supports four), so when
multiple lights are present, the calculations become too complex, context
switches occur because temporary data must be stored somewhere, the graphics
board FIFOs fill up because the main CPU is sending in data faster than the
board can process it, the CPU has to pause constantly to wait for the FIFOs to
drain, and thus the GEs become the main bottleneck. In such situations, the main
CPU may be little used - I saw only 2% CPU usage when running such a scenario on
my Indigo2 Elan.
On the other hand, systems like XL offload all such calculations onto the
main CPU. When things get tough, the main CPU runs as fast as it can just as
always, hence the situation with FIFOs filling up, context switches occuring,
etc. never happens and the system is ironically able to give a fair performance.
That is how an Indy XL can outperform an Indigo2 Elan. It is also the reason why
an R5000 Indy XZ can be slower than an R5000 Indy XL (the former must do
its geometry/lighting calculations on the XZ board, completely wasting the much
higher fp power of the main CPU).
What relevance is this to O2? Well, as CPUs improve in power, it is becoming
very clear that O2 is almost certainly going to significantly outperform even a
MaxIMPACT in the long term for certain types of task - it'll definitely be able
to outperform a HighIMPACT or SolidIMPACT anyway. The reason is
geometry/lighting: as the main CPUs for O2 improve, the single-precision fp
performance will eventually exceed that offered by the GEs of systems like
SolidIMPACT (480MFLOPS), High IMPACT (480MFLOPS) and Max IMPACT (960MFLOPS). A
300MHz R12000 would offer 600MFLOPS, so I would expect an R12K/300 O2 to
outperform a SolidIMPACT Indigo2 for tasks involving complex geometry and
lighting (especially something like multiple spotlights). Casting one's
imagination ahead, I would fully expect something like (say) a 500MHz R14K O2 to
thoroughly outperform an Octane/SI (at least 1GFLOP for the O2 compared to half
that for the Octane/SI's GEs).
These effects will be important unless the bottleneck becomes
something else such as:
- pixel fill,
- texture bandwidth,
- memory bandwidth/latency,
- etc.
These could be important if, for example, the 3D scene contained a very large
number of polygons, or a complex dynamic scene. My comments above mainly refer
to scenes that involve multiple lights and low polygon counts, eg. VRML worlds.
For a more thorough investigation and discussion of these issues, please see
my HolliDance
Benchmark page, which includes a table of example performance results for a
typical dynamic 3D real-time scene that contains complex lighting. If you own or
have access to an SGI, please consider submitting a set of results as I am
convinced that, for relevant tasks, the HolliDance Benchmark results table will
be a very useful resource to 2nd-hand buyers and those considering upgrades from
older systems. It should also be useful to users of faster systems who may be
interested in possible performance degradation when the number of lights reaches
a certain threshold (see the benchmark page for more details).
When thinking about O2, these issues may be important if your task involves
real-time 3D animation, VRML, low-end visual-simulation, etc. It could be
especially relevant if you have an older system, are considering an upgrade, and
aren't sure whether to go for something like an Indigo2 Extreme/IMPACT, an O2,
or an entry-level Octane. It's quite surprising to think that O2 could gradually
be seen to outperform many existing SGIs for tasks that involve complex
lighting. However, I doubt this will occur with Onyx2 since IR supports four
hardware lights and the GEs offer 2.56GFLOPS of processing power - much greater
than even a theoretical 800MHz R14K (unless such a future CPU was able to do 4
fp operations per clock instead of 2 fp operations per clock).
With hindsight, and certainly for particular types of O2 user (eg. anyone
doing VRML), the fact that O2 does all geometry/lighting calculations in
software could prove very advantageous in terms of much greater performance in
the future. Note that this kind of task is very different from the
typical 'primitive' level benchmarks shown on technical reports and PR web
sites. Such simplistic performance figures (eg. flat tris/sec, or lit, shaded,
textured triangles/sec) almost always involve either no lighting whatsoever, or
just a single directional light, thus hardware acceleration boards never
experience the problem of having to deal with more light sources than can be
handled by the hardware at one time. A good example of this that although an
R4600PC 100MHz Indy XL outperforms an R4400SC 250MHz Indigo2 Elan for the
HolliDance 3D animation program by 8 percent (large window, no texture), if one
turns all the lights off then the Indigo2 immediately becomes 158% quicker than
the Indy.
What I've tried to highlight here is that you should be very wary of
assuming O2 must be better or worse than older systems simply because
it's newer, has a better main CPU, etc. The reality may be much more complex
because of the way graphics hardware works and how the different components of a
system interact, coupled with the fact that different systems often work in very
different ways.
For example, one might assume that an O2 should outperform an Indigo2
Extreme for Gouraud shaded tasks, and indeed it does on the primitive level
benchmarks by a moderate to reasonable margin (between 7 and 65 percent for
various CPUs); but what might be a surprise to many is that O2 can completely
stomp over an Indigo2 Extreme for a 3D task that involves multiple lights. For
the HolliDance benchmark, compared to R4400SC/250MHz Indigo2 Elan, the O2 was
510 percent faster! The primitive level benchmarks would have suggested a
difference of around 170%.
But turn off the lights and the difference changes drastically: O2 is now
144% faster than Indigo2 Elan, a figure which correlates much better with the
primitives tests. In other words, when the complex lighting is turned off, both
systems speed up, but Indigo2 Elan speeds up by a much greater degree (300%
compared to 60%) because all the horrible bottlenecks concerning the GEs are
removed, though it's still slower overall. Obviously, I would expect the
differences between O2 and Indigo2 Extreme for HolliDance to be less, but I
reckon O2 would still be at least 200% quicker when the lights are turned on (as
opposed to the 20% difference one might expect from the primitives tests).
3D graphics is a strange thing. Yet again, this is more proof, if any were
needed, that the only benchmark test one should really trust when making a
purchasing decision is one's own application.
HTMLized by radrob@fsck.it
ICE
ICE consists of two parts:
66MHz 64bit R4K-derived control logic unit plus a 66MHz SIMD 128bit MDMX-style
central processing unit. The SIMD core can do sixteen 8bit MACs or eight 16bit
MACs/clock. Each MAC is 2 operations (multiply + add), so 66M * 2 * 16 =
2billion operations/sec. So, the MAC figure is 1 billion MACs/sec for 8bit
integer ops and 500 million MACs/sec for 16 bit integer ops (ICE cannot be used
for fp computation). The unit as a whole is designed to handle multiple data
streams.
The controller element is programmable, to allow for future video and image
formats - this means it's likely that the unit is perfectly capable of doing
four 32bit ops or two 64bit ops per clock, but I don't think the current
libraries support such operations since today's video/image tasks don't need
them.
ICE allows one to do some impressive real-time image and video operations,
some of which are shown in the various O2 demo programs. Real-time examples
include: edge detection, colour space conversion, luma and chroma keying, etc.
For a more thorough description of ICE, please see my main ICE page.
Incidentally, because of the many questions about ICE that I've thrown at
people in SGI, a member of SGI's Global Technical Support has begun the process
of writing a proper report on ICE for a future issue of Pipeline
Finally, here is SGI's own description of the ICE system
The following table lists some key digital media hardware
specs of the o2 O2 :
O2
Image and Compression Engine
(ICE):
* Built-in motion JPEG video
compression/decompression
* Built-in imaging
acceleration
Video input and output
Screen-capture video source
(graphics screen available as
video input device)
Improved digital video camera
with built-in microphone and
shutter button
Silicon Graphics is also releasing IRIXTM 6.3 for O2. This updated
OS version has the following new elements:
* New digital media buffer (DMbuffer) programming
interface for sharing unified memory among the
application, video I/O devices, compression,
graphics rendering, and graphics display
* New Video Library (VL) programming interface
to DMbuffers
* New digital media image conversion (dmIC)
programming interface based on DMbuffer for
direct data transfer among image-conversion
algorithms/devices, video I/O, and graphics
* Hardware-accelerated OpenGL imaging extensions
Audio and Video I/O Ports
The following I/O devices transfer audio samples and video pixels
into and out of main system memory:
* Camera and camera microphone
* Two line-level analog stereo outputs and
one line-level analog stereo input
* S-video and composite video in/out
* Headphones out
* Microphone in (mono)
* Speaker output
* Optional CCIR 601 digital video adapter in/out
Digital Media Buffer Architecture
The DMbuffer is a new API for programmatic access to a new IRIX
operating system feature that unifies the memory buffering systems
of live video devices, such as video input and output and image
compression and decompression. Also, OpenGL can both read from and
render to the DMbuffer system, thus enabling completely
programmable video effects: anything that you can render to a
window you can also render offscreen and send directly to video
output or compression. Furthermore, video input and decompression
output are available for graphics display.
HTMLized by radrob@fsck.it
Image Engine
The software architecture consists of the following elements:
* DMbuffer
* Ability to treat DMbuffer data as pbuffer or
texture map data in OpenGL
* VL receive/send DMbuffers to/from video
I/O hardware
* ICE (Image and Compression Engine) uses
DMbuffers for input and output
* New Digital Media Library (libdmedia)
image conversion API (dmIC)
Image Processing Engine
ICE is a chip, and digital media image conversion (dmIC) is a
software interface. Together, these two components enable video
compression/decompression functions; they also allow applications
to display multiple image streams.
The ICE chip contains the following components:
* MIPS RISC core for program control
* Integer vector unit capable of 8
multiply-accumulates per clock
* Bit stream encoder and decoder
* Intelligent DMA controller
These features are tied together with highly optimized code for
applications such as JPEG encode and decode, general and separable
convolutions, color matrix multiplies, and histogram generation.
Providing the functionality of the Cosmo CompressTM option card
for Indy, ICE is even more flexible than its predecessor. In
addition to handling single streams of live video, ICE is easily
shared between multiple smaller streams (of any size and rate);
for example: 4 quarter-size, full-rate streams are supported as
easily as 1 full-size, full-rate (or 2 half-size, or 3 third-size,
or 2 full-size, half-rate, and so on). Since there is no built-in
video clock or video dimensionality on the ICE chip, you can also
use for non-standard sizes and rates; for example, film aspect
ratio at film rate for film animation preview to the graphics
monitor.
With the Indy, all imaging and compression calculations were done
by the main CPU. ICE, which functions as a separate CPU, now
handles these calculations, which frees the main CPU to handle
other processes. Also with the Indy, you had to purchase dedicated
cards, such as a JPEG card, to handle jobs such as compression.
Silicon Graphics designed O2 with flexibility as a key objective.
Consequently, the system can handle JPEG compression as well as
image-processing functions, without having to purchase dedicated
cards for each process.
The IO Engine (IOE) is a chip that brings video and audio into and
out of the system. Both IOE and ICE feature direct memory access
(DMA) controllers, which enables them to read compressed images
and output the information to a video out channel.
Not only do IOE, ICE, and UMA simplify the sharing of digital
media data between subsystems, their interaction is many times
faster than more common methods of transferring data between
subsystems over a system bus.
New Image Conversion API
The Digital Media Library (libdmedia) that's included with IRIX
6.3 features a new digital media image conversion library (dmIC).
You use this low-level API for memory-to-memory image
compression/decompression and conversion.
dmIC supports the standard software image codecs supported by the
older Compression Library (libcl) interface in IRIX 6.2 and
earlier releases. dmIC also supports the real-time motion JPEG
encode/decode capability of the O2 ICE processor:
* The dmIC interface makes software image
codecs and hardware-accelerated memory-to-memory
codecs look the same to application developers.
* dmIC operates on image data stored in DMbuffers.
This makes it possible to share image data
between hardware or software codecs and OpenGL
or the Video Library, without
copying data.
* dmIC does not support in-line compression devices
that are integrated into video capture or playback
hardware paths; for example Cosmo Compress, Impact
Compress. These kinds of devices require a slightly
different programming model from the model used to
send data to and receive data from an asynchronous
memory-to-memory processor. The older libcl
continues to provide the applications
programming interface to these kinds of devices.
* An application can query dmIC to determine
whether the current system offers a real-time
implementation of a particular memory-to-memory
codec; for example, JPEG.
The real-time JPEG codec on O2 supports full-rate
encode/decode at NTSC/PAL square pixel,
CCIR 601/525, and CCIR 601/625 video timings.
On systems that are not equipped with a real-time
memory-to-memory codec, an application can
also use the non-real-time software implementation.
* The Compression Library functionality offered in
IRIX 6.2 will continue to be supported in IRIX 6.3
and future releases in order to ensure backward
compatibility for applications.
* Starting with IRIX 6.3, MPEG audio/video encode
and Cinepak encode capabilities are bundled
with every Silicon Graphics system.
These software encoders no longer require a Silicon
Graphics run-time license.
The new dmIC routines are declared in the public header
. The new DMbuffer routines for creating
and manipulating DMbuffers are declared in .
OpenGL Extensions for Image Data
Silicon Graphics created OpenGL extensions for O2, which allow you
to use DMbuffers as either pbuffers or texture maps. The company
also designed an OpenGL extension for rendering YCrCb (4:2:2)
interlaced data, which lets you save video display pixels in a
pixel format, rather than converting them to bits. Using these
extensions, you can also perform hardware color space conversions
from YCrCb to RGB.
In addition to the new OpenGL extensions, O2 provides hardware
acceleration for the following existing extensions:
* Color scale and bias
* Color table look-ups
* Convolutions: 3x3, 5x5, and 7x7
(separable and general)
* Color matrix multiply
* Histogram and MinMax
The support of these operations should promote interesting
applications, with real-time feedback (attributable to the
performance increase), in the fields of medical imaging, GIS, and
post production. Moreover, the support of a common API (OpenGL)
enables applications to run across the product line, with
performance gains associated with the platform on which the
applications are running.
DMcolor and OpenGL Color Matrix Extensions
With O2, you can use OpenGL hardware to perform transforms. In
addition, DMcolor can set up transform matrices that the
application can pass to OpenGL. The system also has a software
image color space conversion engine in libdmedia. The system also
has a DMcolor API.
Video Library and DMbuffers
The system has new Video Library (VL) calls for receiving video
data (fields or pairs of fields interleaved to form frames) into
DMbuffers, and for sending video data using DMbuffers. In
addition, the video I/O path can handle mipmap generation for live
video. The older VLbuffer interface is still supported as well.
Audio Library Enhancements
Starting with IRIX 6.3, the Audio Library (AL) is packaged as a
DSO rather than as a static library. The Audio Library adds a
number of new functions and features, however the 6.3 version of
the library is backward-compatible with previous releases.
New features in 6.3 include:
* The ability to support multiple audio I/O
devices in a single system.
* Support for the O2 workstation's ability to
lock audio and video sample rates together in
hardware to prevent drift during synchronized
audio/video recording or playback.
In addition, IRIX 6.3 introduces a new, generalized version of the
Audio Control Panel, which can automatically configure itself when
you add audio I/O devices to the system.
High-Resolution Timer for Synchronizing Audio and Video Streams
The O2 workstation includes audio/video hardware support for
Silicon Graphics' high-resolution digital media timer, the
unadjusted system time (UST) clock.
UST provides a common time base for timestamping audio samples and
video fields as they enter or leave the system through the
audio/video I/O ports. AL and VL each support timestamps based on
the UST clock. Applications can use this common timebase to
correlate and synchronize outgoing audio and video input/output
streams. Refer to the man pages alGetFrameTime(3dm) and
vlGetUSTMSCPair(3dm) for more information.
The O2 architecture makes the high-resolution UST clock visible to
PCI option cards as well as to the audio/video subsystems that are
standard on the system.
Movie Library Enhancements
Starting with IRIX 6.3, the Movie Library is packaged as a pair of
DSOs rather than as a single static library. The Movie Library API
is backward-compatible with previous IRIX releases:
* Movie file library (libmoviefile.so) deals with
movie file reading, writing and editing. This
DSO includes the functions defined in the public
header .
* Movie playback library (libmovieplay.so) provides
high-level functions for movie playback with
synchronized sound and images. This DSO includes
the functions defined in the public header
.
The IRIX 6.3 version of the Movie Library offers the following new
features:
* Support for Indeo encoding and writing AVI files
* Support for creating MPEG-1 video and
systems bitstreams through the movie file
library interface
* Support for full-rate, full-resolution motion
JPEG playback with synchronized audio by using
the real-time JPEG decode capabilities of the
O2 ICE processor
* Ability to take advantage of the OpenGL
extensions for rendering interlaced image
data and YCrCb image data on O2
New Audio Conversion API
The Digital Media Library (libdmedia) that's included with IRIX 6.3
features a new digital media audio conversion library (dmAC). You
use this low-level API for memory-to-memory audio sample format
conversion, sample rate conversion, and compression/decompression.
dmAC supports these audio conversion operations:
* Sample data format conversion
(signed, unsigned, float, double, scaling)
* Sample rate conversion
(several algorithms)
* Channel conversion
(mono, stereo, 4-channel, and so on)
* Compression/decompression
IRIX 6.3 supports the following audio compression algorithms:
* CCITT G.711 mu-law and A-law
* CCITT G.722
* CCITT G.726 16, 24, 32, and 40 Kb/sec
* CCITT G.728
* GSM
* Intel DVI ADPCM
* MPEG audio
All of the audio compression/decompression and conversion
algorithms are implemented in software. No special option hardware
is required to perform these conversions.
The new dmAC routines are declared in the public header
.
Starting with IRIX 6.3, MPEG audio encoding is bundled with all
systems and no longer requires a license from Silicon Graphics.
Audio File Library Enhancements
The new version of the Audio File Library (libaudiofile) included
in IRIX 6.3 offers support for several additional sound file
formats:
* Amiga IFF/8SVX
* SampleVision
* Audio Visual Research
* Creative Labs VOC
* Creative Labs SoundFont2
The library now offers transparent sample rate conversion in addition
to transparent sample format conversion and
compression/decompression. You can specify a virtual sample rate from
within your application; for example, 48 kHz. The application can
open sound files that contain data sampled at a variety of rates, and
the library automatically converts between the sample rates used in
the sound files (such as 44.1 kHz, 32 kHz, or 16 kHz) and the virtual
sample rate that the application requests.
HTMLized by radrob@fsck.it
mirror v.0.1.8
O2 tech page
all contents by Ian Mapleson BSc
design ©1997 - 2002 Fsck.it Last updated 20020503 SGI, O2, UMA,
NUMA, PUMA, and probably some other terms are trademarks of SGI (Silicon
Graphics, Inc.).
contact UNOFFICIAL o2 tech page
|
|