8 channel LPCM over HDMI, It's Multi Amp for Everyone

KOON3876 · 2008-08-24 4:11 am

nuhi
http://www.microsoft.com/express/2005/

fb,
9500GT CUDA1.1 is float in / float out.
If you use GTX280, it can be double in / double out. but GTX280 is too noisy, (yes I have in another PC) to use in Audio PC.
Now I moved to CUDA 2.0.

TAPS: 8192 is not a limit, it can be 16384 without change of logic.
__device__ __constant__ float coeff_kernel
maximum size is 65536 bytes = 16384 float.
If you need more TAPS, coeff_kernel have to be placed in global.

I think 65536 TAPS is pretty bellyful? it has 0.67Hz filter pitch and 1.48second delay at 44100 sampling.
but "how long FIR taps can CUDA process?" is interesting challenge

I should build another VS2005 project for GTX280 later.

If you are in US I thought to send you GeForce8400!
but postage to AU is $30, equal to card itself.

KOON3876 · 2008-08-24 6:52 am

please let me use here as memo, I was missimplementing delay/taps.

(1) TAP length and delay
Current, new incoming buffer is used for FIR processing.
[NNNNNNNN] requires = TAPS length

New: ex, 1024 samples frame incoming and kept 15360 samples
[PPPPPPPPPPPPPPPN], delay is 1024 sample.
Prev 1024*15 samples + New 1024 samples are used for 16384 length FIR. [PPPPPPPPPPPPPPPN] is stored and manipulated in host memory. it will work as delay line ( can contain channel delay logic).

(2) FIR Calling Sequence
Current, kernel called 8 times per frame. and taps limited 16384 by constants memory size.
for (way 0,1,2,3)
for (ch L, R)
copy FIR Coeff to constants
do FIR<<kernel>>

New, kernel call once per frame, no taps length limit
Copy Coeff to Global (once )
While()
wait Event
Copy [PPPPPPPPPPPPPPN] to device
Execute <<kernel>>(global, global)
Copy result to Host
Loopend

(3) Input stream status
by paInputUnderflow, paAbort, stream stop.
add input stream check, pass 00000000 to GPU.

fb · 2008-08-24 1:20 pm

KOON3876 said:
fb,
9500GT CUDA1.1 is float in / float out.
If you use GTX280, it can be double in / double out. but GTX280 is too noisy, (yes I have in another PC) to use in Audio PC.
Now I moved to CUDA 2.0.

If you are in US I thought to send you GeForce8400!
but postage to AU is $30, equal to card itself.

Thanks

I'll buy a card, just wanting to be somewhat future proof

I'm not good at understanding the maths side of it.... can 9500gt do 24bit?

nuhi · 2008-08-24 1:25 pm

fb, aha so just change compiler path to VS8 and use VS9, good.

Koon, what kind of answer is that, did you even read what I asked...this is the second time you give me a half-a$$ed answer and then I need to cool down, yeah right.

KOON3876 · 2008-08-24 6:09 pm

fb, accuracy of float, below shows what you want?

f2 = (float)X1 / (float)( 65536 * 128 );
printf("Int %d\n", X1);
printf("float %2.16f\n", f2);
printf("converted back to int %d\n", (long)(f2*(65536 * 128)));

Int 1
float 0.0000001192092896
converted back to int 1
Int 16
float 0.0000019073486328
converted back to int 16
Int 151
float 0.0000180006027222
converted back to int 151
Int 1366
float 0.0001628398895264
converted back to int 1366
Int 12301
float 0.0014663934707642
converted back to int 12301
Int 110716
float 0.0131983757019043
converted back to int 110716
Int 996451
float 0.1187862157821655
converted back to int 996451

nuhi, I just post URL how to get 2005. That's simplest way to Compile CUDA. why you are so complaint? This is DIYaudio.com.

nuhi · 2008-08-24 6:30 pm

Koon, you got to be kidding me, you really don't understand what I'm asking. Fine, I won't post in this topic any more.

KOON3876 · 2008-08-24 10:07 pm

???

He mentioned he can't compile with VS2008 and he knows what he wants to do, and I posted free VS2005 express URL.

Or does he want ME to build another version as he like??

KOON3876 · 2008-08-25 4:59 am

(1) 131072 TAPS test
It looks like GTX280 can process 131072 taps FIR in realtime.
================
Using device 0: GeForce GTX 280
[RIFF] (28376916 bytes)
[WAVEfmt ] (16 bytes)
[data] (28376880 bytes)
total samples = 7094220, music length = 160 sec.
Started 970ms
Calculating 14 / 14
done FIR processing.
Ended 113308ms
elapsed 112 sec.
================
required performance
131072(tap)*44100(sample/sec)*2(channel)*4(way)*2(op) = 92GFlops,
achieved performance
92GFlops * 160/112 = 131GFlops
... Still I'm not using full power of GPU. I need to study streaming or concurrent operation. (now GPU is idle when copying something from/to)

(2) Errors I see
connect / disconnect SPDIF can make error below. or I have to run thousands seconds to see.

type 1: sudden stop of stream, patestCallback is not called in cyclic.
I added WaitForSingleObject(patestEvent, 740ms); to find this case. I have to re-initialize buffer and stream.
type 2: PaStreamCallBackFlag 0x0001 paInputUnderFlow
input is not ready when callback called?

Still I'm doing long-run test now, and I see type(2) error at 3,541sec, type(1) error at 5,309sec.
===============================
anyway there are headroom for performance, errors to be resolved.

phofman · 2008-08-25 6:04 am

I am just remotely watching your thread, amazing work. Your input (spdif-in) and output (hdmi) clocks are likely unsynchronized. I am afraid in the long run that will be causing buffer under/overruns too. Though when taking into account the delay introduced by the long filter, a few thousand samples long buffer could postpone the mismatch for a few minutes. My CD player and sound card clocks differred one sample a second on 44.1kHz.

KOON3876 · 2008-08-25 3:17 pm

phofman said:
I am afraid in the long run that will be causing buffer under/overruns too.

Thank you, also I'm afraid of unsync now.
44102 / 44100 will consume all buffer length 8192, in 4096 seconds. ... it looks like my case.

Now I made "Dummy In" "Dummy Out" mode test program.
Logics are same, but Dummy In works only with OUT stream. input is always 0000 from dummy. Dummy Out works only with IN stream, output is to dummy.
I started "Dummy In" version now - will run for 8 hours

phofman · 2008-08-25 4:19 pm

I guess the only way out is

* using synchronized clocks (master clock for both the cards, presumably unfeasible for HDMI)

* using synchronous input - e.g. reading from a file

* controlled dropping/making up samples as needed, the way broadcasted-stream receivers do - probably unacceptable for hard core audiofiles

fb · 2008-08-25 4:59 pm

Perhaps you could try creating an aggregate ASIO device using Asio4All? (combines 2 soundcards into 1)

phofman · 2008-08-25 5:14 pm

fb said:
Perhaps you could try creating an aggregate ASIO device using Asio4All? (combines 2 soundcards into 1)

I am afraid it would not help with the two independent clock. E.g. pulseaudio can join outputs of two cards but it tries to align the clock by delays of the two streams, inevitably introducing a minor distortion.

http://pulseaudio.org/wiki/FAQ#CanI...stereosoundcardsintoavirtualsurroundsoundcard

KOON3876 · 2008-08-26 2:59 am

Attached is tempolary test code, for IN_Dummy and OUT_Dummy.
GPU logic can not know the difference.

IN_Dummy was running 23,944 seconds without error, and OUT_Dummy is running now 16,490 seconds without error.
Now I can believe there are unsync issue between in and out stream.

I will separate in stream and out stream, then implement FIFO before HDMI output.

Which one will be better?
(1) insert / remove one or two "sample" to FIFO every second, to sync.
(2) insert / remove "Frame" per some thousands seconds.
(3) Forget Streaming, implement syncronous file input

(1) is unnoticeable but yucky.
(2) is noticeable once per thousands seconds, but there are no modification within another thousands seconds.

Anyway I will implement (3) later for me. I have many 16GB stick(will store 20 CD per stick), and thinking to buy 128GB MLC SSD for main wav storage. I'm using G.SKILL FS-25S2-32GB SLC (OEM, samsung MCBQE32G5MPP) as OS/Programs. superb.

KOON3876 · 2008-08-26 3:56 pm

http://koonlab.com/CUDA_Sample/Simple CUDAble code.html

I'm testing how GPU is powerful? now and found easiest CUDA sample. Calculating 131072 Taps coefficients took over 20 minutes, So I changed it's time consuming routine to CUDA.

fb · 2008-08-26 4:08 pm

Would there be sync issues streaming from software eg. foobar2000 -> jack -> cudafir -> jack -> hardware output

?

KOON3876 · 2008-08-26 4:36 pm

Hi fb
Jack is Linux/OSX, so I can't try, but if jack can control input sampling rate to adjust output sampling rate, it will have no problem.
I'm thinking to write my own simple player and FIR controler because it will be easier than making "virtual WDM driver which sync to output". I don't need skin, EQ, rating, device transfer, ripping, just want jacket display (show jpeg image in wav folder)

anyway I will implement (1) or (2).
I like (2) because,
(a) There are no continuous distortion.
(b) If I know length of CD / or play list, I can prepare enough buffer to last whole CD. Then I can restart stream from controller.
I don't care stream restarting happen between CD.

fb · 2008-08-26 4:47 pm

Jack for windows, including ASIO driver:

http://www.grame.fr/~letz/jackdmp.html

You'll need a player that can talk ASIO eg. foobar2000 + foo_asio.dll

Source for the Jack ASIO driver is included, I think it's probably not too hard to write an equivalent directsound driver.

Number (2) will be good for processing external stream input -> HDMI out.

phofman · 2008-08-26 9:16 pm

IMHO the main added and unique value of the project is the CUDA filter. The rest (input, output) are just auxiliary code. If this project was based on linux, it would probably end up as a library with reasonable API. Very soon people would produce a standalone jack-enabled filter (similar to brutefir), a plugin for alsa, an effect for SoX, a LADSPA filter etc. Some of them would be probably written by the filter author as he needs to test a real-world implementation. The author would not care about input sources, input file formats as that is already handled by other applications supporting the various backends, such as SoX.

Syncing of input/output would not be focus of the filter as it is a completely different issue.

KOON3876 · 2008-08-27 2:28 am

Hi phofman
this project was posted 03/30/2008
http://www.diyaudio.com/forums/showthread.php?s=&threadid=120463
and CUDA zone
http://www.nvidia.com/object/cuda_home.html#state=home
(Video and Audio) has link to my page.

I think it's enough time to be ported if Linux/Mac programmer was interested in CUDA / Audio, but maybe nothing yet.
So ... perhaps we can not be hopeful that "CUDA" is well known to everyone

8 channel LPCM over HDMI, It's Multi Amp for Everyone

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Attachments

Member

Member

Member

Member

Member

Member