Samstag, 28. Februar 2009

Voxelstein

Here a scene from Voxelstein
Surprisingly, the rendering is very quick :-)

Download Demo (Cuda 1.1)

It's an alphaversion,
so not expect anything ;-)

Freitag, 27. Februar 2009

First Color for the new Version

Today the scene gots a bit more colored.

Benchmarks so far:
1024x1024, 1024 rays : 40ms / 25 fps avg.
1024x768 , 1024 rays : 39ms / 25 fps avg.
1024x768 , 512 rays : 36ms / 27 fps avg.
1024x768 , 256 rays : 36ms / 27 fps avg.
512x512 , 512 rays : 21ms / 47 fps avg.
512x512 , 256 rays : 21ms / 47 fps avg.
512x512 , 128 rays : 21ms / 47 fps avg.

So far I couldnt figure out why less rays not increase the framerate significantly - the computation cost proportional to the number of rays.

Donnerstag, 26. Februar 2009

Texture-mapping Works !

Here an actual screenshot
25 fps@1024x768 :-)
View distance: 40.000 voxels
(with mip-mapping)

Mittwoch, 25. Februar 2009

New Download

Here you can download the Demos below.
Its still the old version, so rendering at 1024x768 is not fast yet.

Required: Cuda 1.1
Filesize: 22MB

VoxelDemos.zip

Dienstag, 24. Februar 2009

Progress

Mapping seems to work now - next step is to compute the rays accurately.

Donnerstag, 19. Februar 2009

Performance

Today I measured the performance of the actual implementation. The result: The scene on the right has about 8.0M RLE elements in the view frustum, out of which 4.7M are not culled and 280k are visible, rendered as 450k pixels. This, at a frame-rate of about 25 means the renderer processes about 117M RLE elements/second. 

My graphic cards maximum untextured triangle performance is 280M/s in case the triangles share vertices, and about 133M/s in case the triangles have independent vertices. Maximal vertex transform rate is about 400M/s.

This means, if the landscape would be visualized using splats, each rendered as single triangle, then at least 8M triangles would be required. Without any culling, this would lead to a performance of about 133/8=16 fps. Here, perhaps the geometry shader might be used to accelerate the rendering. It would be possible to send only one vertex from which the geometry shader generates a quad or triangle.

I case we would visualize each voxel inside the landscape using conventional polygons, we would have to use at least 2 triangles for each to create a quad. This means, taking shared vertices into account, We would have to render at least 16M quads, resulting in a theorethic frame rate of 280/16=17.5.

Common technologies to render Voxels

1.) Heightmap based Voxel-Terrain. Ref
2.) Voxlap Technology. Ref1 Ref2
3.) Splatting / Rendering voxels as sprites. Ref1 Ref2
4.) Sparse Voxel Octree (SVO) Raycasting. Ref
5.) Mixed Octree / Regular Grid Raycasting. Ref
6.) Mixed Polygon / Voxel (Splatting) rendering Ref
7.) Rendering Voxels as Polygons.Ref

Freitag, 13. Februar 2009


Its more difficult than one could expect.. Simple texture mapping can't do the job actually. This means, a pixel-shader is necessary to do the unwrapping.

Still some work


Until the new version is complete, its necessary to stretch out the temporary rendered buffer correctly using texture mapping.. Left we can see the debug output so far.

Cuda output right now is 32 bit. 16 bit depth, 8 bit color + 8 bit normal.

Donnerstag, 12. Februar 2009

More Speed

After adjusting a couple of parameters and doing further opimizations I got get 20 fps at 1024x768 now. Update: after some more optimizations, 30 fps seem possible at 1024x768 :-)

Mittwoch, 11. Februar 2009

Bunny Dataset

The Bunny dataset has a very complex surface and runs only at about 7-8 fps (1024x768)

Today: The Bonsai Tree Dataset

Today I played a bit with the bonsai dataset from here.The number of visible trees is about 3000.

Tips for CUDA programming

If some of you think of writing a CUDA program, here a couple of things to keep in mind:

1.) Reduce the number of used registers to run more parallel threads
2.) Reduce the number of memory accesses
3.) Store runtime variables in registers
4.) Do not use local arrays in your code like int a[3]={1,2,3} - better use variables such as a0=1;a1=.. etc if possible.
5.) Write small kernels. If you have one large Kernel, try to split it up into multiple small ones - it might be faster due to less used registers.
6.) Use textures to store your data where possible. Texture reads are cached - global memory reads aren't.
7.) Conditional jumps should branch equal for all threads
8.) Avoid loops which are run only by a minority of threads while the others are idle
9.) Use fast math routines where possible
10.) A complex calculation often is faster than a large lookup table
11.) Writing your own cache manager that uses shared memory for caching might not be an advantage
12.) Try to avoid multiple threads accessing the same memory element (accesses get serialized - also for shared mem)
13.) Try coalescence of global memory accesses.
14.) Try to avoid bank conflicts for reading memory
15.) Small lookup tables can be stored in shared mem
16.) Experiment with the number of parallel threads to find the optimum. In case you run out of registers, use --maxrregcount=...

17.) If you can implement you method using GLSL, it might be faster than CUDA. In GLSL you get a lot of calculations for free like alpha blending, fog, z buffer testing, interpolation of variables between pixels and perhaps a better thread handling too. Also you not have to copy around the rendered image as PBO and you'll save development time since there is no bluescreen from a bad pointer.

Mittwoch, 4. Februar 2009

First Colorful Screenshots

Here the first colorful result of what voxel can do when used with a simple L-System.

I tried to add the culling feature, which is used in Voxlap, but unfortunately it didnt give any speedup wth CUDA.. So back to the previous version.

For the demo, Cuda 1.1 is required.

[ Download Link ( ca.10 MB ) ]