8 channel LPCM over HDMI, It's Multi Amp for Everyone

Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.
nuhi
http://www.microsoft.com/express/2005/

fb,
9500GT CUDA1.1 is float in / float out.
If you use GTX280, it can be double in / double out. but GTX280 is too noisy, (yes I have in another PC) to use in Audio PC.
Now I moved to CUDA 2.0.

TAPS: 8192 is not a limit, it can be 16384 without change of logic.
__device__ __constant__ float coeff_kernel
maximum size is 65536 bytes = 16384 float.
If you need more TAPS, coeff_kernel have to be placed in global.

I think 65536 TAPS is pretty bellyful? it has 0.67Hz filter pitch and 1.48second delay at 44100 sampling.
but "how long FIR taps can CUDA process?" is interesting challenge:)
I should build another VS2005 project for GTX280 later.

If you are in US I thought to send you GeForce8400!
but postage to AU is $30, equal to card itself.
 
please let me use here as memo, I was missimplementing delay/taps.

(1) TAP length and delay
Current, new incoming buffer is used for FIR processing.
[NNNNNNNN] requires = TAPS length

New: ex, 1024 samples frame incoming and kept 15360 samples
[PPPPPPPPPPPPPPPN], delay is 1024 sample.
Prev 1024*15 samples + New 1024 samples are used for 16384 length FIR. [PPPPPPPPPPPPPPPN] is stored and manipulated in host memory. it will work as delay line ( can contain channel delay logic).

(2) FIR Calling Sequence
Current, kernel called 8 times per frame. and taps limited 16384 by constants memory size.
for (way 0,1,2,3)
for (ch L, R)
copy FIR Coeff to constants
do FIR<<kernel>>

New, kernel call once per frame, no taps length limit
Copy Coeff to Global (once )
While()
wait Event
Copy [PPPPPPPPPPPPPPN] to device
Execute <<kernel>>(global, global)
Copy result to Host
Loopend

(3) Input stream status
by paInputUnderflow, paAbort, stream stop.
add input stream check, pass 00000000 to GPU.
 
KOON3876 said:
fb,
9500GT CUDA1.1 is float in / float out.
If you use GTX280, it can be double in / double out. but GTX280 is too noisy, (yes I have in another PC) to use in Audio PC.
Now I moved to CUDA 2.0.

If you are in US I thought to send you GeForce8400!
but postage to AU is $30, equal to card itself.


Thanks :) I'll buy a card, just wanting to be somewhat future proof :) I'm not good at understanding the maths side of it.... can 9500gt do 24bit?
 
fb, accuracy of float, below shows what you want?

f2 = (float)X1 / (float)( 65536 * 128 );
printf("Int %d\n", X1);
printf("float %2.16f\n", f2);
printf("converted back to int %d\n", (long)(f2*(65536 * 128)));


Int 1
float 0.0000001192092896
converted back to int 1
Int 16
float 0.0000019073486328
converted back to int 16
Int 151
float 0.0000180006027222
converted back to int 151
Int 1366
float 0.0001628398895264
converted back to int 1366
Int 12301
float 0.0014663934707642
converted back to int 12301
Int 110716
float 0.0131983757019043
converted back to int 110716
Int 996451
float 0.1187862157821655
converted back to int 996451


nuhi, I just post URL how to get 2005. That's simplest way to Compile CUDA. why you are so complaint? This is DIYaudio.com.
 
(1) 131072 TAPS test
It looks like GTX280 can process 131072 taps FIR in realtime.
================
Using device 0: GeForce GTX 280
[RIFF] (28376916 bytes)
[WAVEfmt ] (16 bytes)
[data] (28376880 bytes)
total samples = 7094220, music length = 160 sec.
Started 970ms
Calculating 14 / 14
done FIR processing.
Ended 113308ms
elapsed 112 sec.
================
required performance
131072(tap)*44100(sample/sec)*2(channel)*4(way)*2(op) = 92GFlops,
achieved performance
92GFlops * 160/112 = 131GFlops
... Still I'm not using full power of GPU. I need to study streaming or concurrent operation. (now GPU is idle when copying something from/to)

(2) Errors I see
connect / disconnect SPDIF can make error below. or I have to run thousands seconds to see.

type 1: sudden stop of stream, patestCallback is not called in cyclic.
I added WaitForSingleObject(patestEvent, 740ms); to find this case. I have to re-initialize buffer and stream.
type 2: PaStreamCallBackFlag 0x0001 paInputUnderFlow
input is not ready when callback called?

Still I'm doing long-run test now, and I see type(2) error at 3,541sec, type(1) error at 5,309sec.
===============================
anyway there are headroom for performance, errors to be resolved.
 
I am just remotely watching your thread, amazing work. Your input (spdif-in) and output (hdmi) clocks are likely unsynchronized. I am afraid in the long run that will be causing buffer under/overruns too. Though when taking into account the delay introduced by the long filter, a few thousand samples long buffer could postpone the mismatch for a few minutes. My CD player and sound card clocks differred one sample a second on 44.1kHz.
 
phofman said:
I am afraid in the long run that will be causing buffer under/overruns too.

Thank you, also I'm afraid of unsync now.
44102 / 44100 will consume all buffer length 8192, in 4096 seconds. ... it looks like my case.

Now I made "Dummy In" "Dummy Out" mode test program.
Logics are same, but Dummy In works only with OUT stream. input is always 0000 from dummy. Dummy Out works only with IN stream, output is to dummy.
I started "Dummy In" version now - will run for 8 hours:)
 
I guess the only way out is

* using synchronized clocks (master clock for both the cards, presumably unfeasible for HDMI)

* using synchronous input - e.g. reading from a file

* controlled dropping/making up samples as needed, the way broadcasted-stream receivers do - probably unacceptable for hard core audiofiles :)
 
Attached is tempolary test code, for IN_Dummy and OUT_Dummy.
GPU logic can not know the difference.

IN_Dummy was running 23,944 seconds without error, and OUT_Dummy is running now 16,490 seconds without error.
Now I can believe there are unsync issue between in and out stream. :(
I will separate in stream and out stream, then implement FIFO before HDMI output.

Which one will be better?
(1) insert / remove one or two "sample" to FIFO every second, to sync.
(2) insert / remove "Frame" per some thousands seconds.
(3) Forget Streaming, implement syncronous file input :rolleyes:

(1) is unnoticeable but yucky.
(2) is noticeable once per thousands seconds, but there are no modification within another thousands seconds.

Anyway I will implement (3) later for me. I have many 16GB stick(will store 20 CD per stick), and thinking to buy 128GB MLC SSD for main wav storage. I'm using G.SKILL FS-25S2-32GB SLC (OEM, samsung MCBQE32G5MPP) as OS/Programs. superb.
 

Attachments

  • wavexgen06_inouttest.cu.txt
    24.4 KB · Views: 45
Hi fb
Jack is Linux/OSX, so I can't try, but if jack can control input sampling rate to adjust output sampling rate, it will have no problem.
I'm thinking to write my own simple player and FIR controler because it will be easier than making "virtual WDM driver which sync to output". I don't need skin, EQ, rating, device transfer, ripping, just want jacket display (show jpeg image in wav folder) :)
anyway I will implement (1) or (2).
I like (2) because,
(a) There are no continuous distortion.
(b) If I know length of CD / or play list, I can prepare enough buffer to last whole CD. Then I can restart stream from controller.
I don't care stream restarting happen between CD.
 
IMHO the main added and unique value of the project is the CUDA filter. The rest (input, output) are just auxiliary code. If this project was based on linux, it would probably end up as a library with reasonable API. Very soon people would produce a standalone jack-enabled filter (similar to brutefir), a plugin for alsa, an effect for SoX, a LADSPA filter etc. Some of them would be probably written by the filter author as he needs to test a real-world implementation. The author would not care about input sources, input file formats as that is already handled by other applications supporting the various backends, such as SoX.

Syncing of input/output would not be focus of the filter as it is a completely different issue.
 
Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.