Biquad calculations for custom filter design

JohnPM · 2010-06-28 10:33 pm

gedlee said:
Its not my expertise, but my understanding was that for a fixed number of bits, floating point offered better resolution, which is contrary to the way I read your post.

Not necessarily intuitive, but it's the opposite. Floats bring vastly increased dynamic range, expressing that as the ratio between the largest and smallest number they can represent, but lower resolution for any given value. In a typical 32-bit processor a float allocates 24 bits for the mantissa and 8 bits for the exponent. In a fixed point implementation all 32 bits are allocated to the equivalent of the float's mantissa (fixed point can be considered in a sense as a float with a fixed exponent), so the fixed point implementation gains 8 bits of resolution. In the problematic area of subtracting numbers of the same scale (and hence having the same exponent) but differing by very small amounts, the fixed point solution takes full advantage of the extra 8 bits.

gedlee · 2010-06-28 11:44 pm

Actually it makes perfect sense the way you state it. Thanks.

abraxalito · 2010-06-29 1:48 am

JohnPM said:
Floats bring vastly increased dynamic range, expressing that as the ratio between the largest and smallest number they can represent, but lower resolution for any given value. In a typical 32-bit processor a float allocates 24 bits for the mantissa and 8 bits for the exponent.

The standard 24/8 split mantissa/exponent in single precision is fine for general math but way suboptimal for audio - who could possibly need a dynamic range of 760dB? <edit> I only took the exponent, including the mantissa too its more like 900dB.

gedlee · 2010-06-29 2:55 am

JohnPMs point is absolutely correct, that it's small differences between large numbers that cause problems at low frequencies and dynamic range has nothing to do with that. There are many many numerical problem in Physics which are completely pointless in single precision. I have even done problems that would not converge in double precision. And - another plug for FORTRAN! - thats why FORTRAN has Quad precision - for crazy Physicists like me!

abraxalito · 2010-06-29 3:22 am

gedlee said:
JohnPMs point is absolutely correct, that it's small differences between large numbers that cause problems at low frequencies and dynamic range has nothing to do with that.

JohnPM's points are totally correct but dynamic range is to some extent a trade-off against resolution. Bits taken from the mantissa reduce resolution but increase dynamic range. Audio does not need the dynamic range offered by 24/8 so resolution suffers. In using doubles, in order to get the needed resolution, the dynamic range becomes even greater - that's an even bigger waste of bits. As in all engineering, there are trade-offs so its misleading to say 'dynamic range has nothing to do with resolution'.

Ron E · 2010-06-29 3:46 am

gedlee said:
And - another plug for FORTRAN! - thats why FORTRAN has Quad precision - for crazy Physicists like me!

Any recommendations on free compilers, or do you only use commercial ones
?
I used to be called "FORTRAN Man" back in the day

My contemporaries in school complained about the "dead" language, but FORTRAN is much better for teaching numerics than anything else.

I was watching a recent NOVA show where they were doing weather sims and they showed code on the screen - FORTRAN....NOT a dead language.

gedlee · 2010-06-29 5:12 am

Ron E said:
Any recommendations on free compilers, or do you only use commercial ones
?
I used to be called "FORTRAN Man" back in the day
My contemporaries in school complained about the "dead" language, but FORTRAN is much better for teaching numerics than anything else.

I was watching a recent NOVA show where they were doing weather sims and they showed code on the screen - FORTRAN....NOT a dead language.

Not dead at all! Just look into Intel FORTRAN, not free, but amazingly sophisticated. With Intel FORTRAN and Visual Studio I can seamlessly program in whatever language is best suited to the task - all in a single project. In number crunching there is no contest that FORTRAN is still supreme. Multiply two complex matrices as A = B*C and the compiler parralelizes it!

abraxalito · 2010-06-29 6:05 am

gedlee said:
Multiply two complex matrices as A = B*C and the compiler parallelizes it!

How many cores does this scale up to?

gedlee · 2010-06-29 2:51 pm

As many as are present in the system.

parb · 2010-06-29 4:18 pm

If you like fort then consider loading up the cuda library from nvidia. If younhave one of their newish graphics units you can directly do multithreading against on it's hundreds of processing units. Works pretty well actually and a heck of a lot performance vs running these as os tasks with the memory models and cache and memory coherency bottlenecks on can run unto.

Put it this way, it's hundreds of cores In these processor units vs the four to sixteen you get from a general purpose intel CPU.

Fun stuff, I used to wrote compiler backends fifteen years ago!

gedlee · 2010-06-29 5:23 pm

Extremely interesting!! Thanks for the info. That may well be in my future!!

abraxalito · 2010-06-30 1:23 am

gedlee said:
As many as are present in the system.

So no cross-compilation for a different target system then? Or have I missed something?

gedlee · 2010-06-30 2:50 am

I believe that the software is aware of how many processors there are and uses what is available. Thats the way with all multi-threaded apps, right? You can't assume multiple processors, the software has to be aware of how many are avaliable and if only one then it uses only one, etc. etc.

abraxalito · 2010-06-30 3:43 am

gedlee said:
I believe that the software is aware of how many processors there are and uses what is available.

If the compiler is aware of how many cores are available on the compilation system, fair enough. But it can't know about the system the compiled code is to run on unless you tell it. When the target system is the same as the compiling system, no problem. But isn't your code able to run on other systems? What happens if you send a binary to someone else?

Thats the way with all multi-threaded apps, right?

What's the app in question here - the compiler itself or the binary you're creating with your FORTRAN compiler? You had said that the compiler parallelized the matrix multiplication - did you mean it created multiple threads, one for each raw multiply? I'm not knowledgeable about how threading works on Intel machines, so I'm hoping to learn something.

You can't assume multiple processors, the software has to be aware of how many are avaliable and if only one then it uses only one, etc. etc.

Sure, was just curious about whether this could be done after compilation or if the compiler needed to know at compile time how many cores there were.

gedlee · 2010-06-30 5:27 am

The number of cores on the target machine is available from the system resources, just as any other system feature. All the software has to do is query the system and it knows how many cores there are. It doesn't matter how many cores are on the compiling machine, its where the executable runs that matters.

In a matrix multiplication, for example, each row column multiply (there are about N^2 of them for an N X N matrix) multiple runs on seperate thread. This means that four cores could do this job in 1/4 the time.

Thats what is so cool about the GPU thing above. Those can have dozens of seperate cores. Matrix multiplications is what they do. As long as the graphics doesn't need them why not let some FORTRAN code use them!

abraxalito · 2010-06-30 5:53 am

gedlee said:
The number of cores on the target machine is available from the system resources, just as any other system feature.

Cannot follow you at all here. System features are features of the machine I'm using now, not of any target machine the binary is intended to run on. So whilst I can see that the compiler can easy find out the number of cores on the machine it runs on, I'm still jolly unclear how it would find out about the target machine without an explicit compiler switch.

All the software has to do is query the system and it knows how many cores there are.

Yes, but only for the machine its running on, not for any possible target machine. Its not clairvoyant!

It doesn't matter how many cores are on the compiling machine, its where the executable runs that matters.

Indeed. So if and when you send your binary to a friend to run, at compile time there's no way of knowing how many cores he has.

In a matrix multiplication, for example, each row column multiply (there are about N^2 of them for an N X N matrix) multiple runs on seperate thread.

OK, so that answers one of my earlier questions. Its multi-threaded, so then the compiler must create a separate thread for each of the individual multiplies. I'm not aware of how the next step works - how each multiply gets allocated to a different core running a different thread. There's probably a fair amount of overhead in spawning a new thread on a separate core (is this handled through an OS call?), so I think this would take a lot longer than one single FPU-assisted multiply which these days are jolly fast. So with my (naive) understanding of multi-threads I can't see how this can be efficient if only single multiplies are allocated to separate threads. As far as I can see, it would be far better to allocate whole columns to threads so the thread-spawning overhead is reduced.

This means that four cores could do this job in 1/4 the time.

Assuming there's no thread spawning overhead, which I take to be an assumption too far.

Thats what is so cool about the GPU thing above. Those can have dozens of seperate cores. Matrix multiplications is what they do. As long as the graphics doesn't need them why not let some FORTRAN code use them!

Well indeed, but in this case the number of cores (presumably dozens or hundreds even) is known before the compilation takes place (known target architecture), so the optimisation can be done really well. And this is multicore, not multi-thread.

gedlee · 2010-06-30 1:41 pm

When the compiled program runs on the target machine it queries the OS which tells it how many processors there are. Each row/column multiple is done as a thread. If there is one processor then the threads run sequentially. If there are four processors then the treads are done in parrallel four at a time. The thread spawning overhead is done by the compiler, there is no overhead on the target machine once compiled.

parb · 2010-06-30 4:23 pm

f77

generally creating a thread is about 10% of the cost (performance cost) vs. creating a process. the OS tasks involved in creating processes (process table entries, establishing memory space etc. is a lot of work).
threads simplifies the intra-process communications since they run in the same adress space and synchronization primitives are generally simpler to implement as cpu instructions without having to go through kernel calls with context switching etc. threads are goodness for computational tasks, especially if you execute in large virtual memory adress spaces and 64 bit.

the question came up on how at compile time vs. run time one can be effective with threading since the compile system may not know how many cores are available at run time. This is a good point and is often adressed by the people doing the compiler.

first off in fortran do loops are what is mostly parallizable. and those do loops generally cant be nested or perform data reductions. however gedless wants to do additions on large matrices of data which is highly parallelizable; if we have n entries in array x and array y which will be added or otherwise operated on and stored in array z then we can easily parallelize this.

here is some pseudo code to illustrate my point:
do n=1 , 100000
z

= x

+ y

end do

we will go through two 100,000 entries arrays and add those values to same position in a third array (array z). to parallellize this we can split the work into two threads and have thread one run the first 50,000 entries and thread two run the second 50,000 entries. once both threads completes the do loop has been parallelized.

the compilers i worked on (these are mainstream compilers) will select a chunk size (minimum amount of iterations run in each thread, excluding the last thread which will run something equal to or less than chunk size). At process start (initialization) a piece of code typically runs which creates a few worker threads which are just created, at init time they idle. The quantity of threads is usually dependent on number of cores present in the system at program init or if an explicit directive is given at compile time it will follow that directive. Once the program reaches the entry point of the do loop the compiler will call the threads and pass on the separately parallelized loops as unrolled functions each of 'chunk size' for them to execute. The main program will idle waiting for the issued worker threads to complete. since all runs in the same memory space and there are no concurrency issues we can safely wait and once completed continue main program execution. voila -we got threading!

in very large do loops the worker threads may be called multiple times to complete the work, especially if the length of the do loop isnt known at compile time (such as do n=1 , m) where m cant be resolved at compile time.

the chunking and scheduling function of a multithreaded program is important as this allows us to keep both the processors hot with a stream of instructions as well as allows us to keep a sequence of data in cache so that we dont have to stall the cpu while fetching memory. mainstream cpu's pay attention to cache coherency and to instruction flow.

im not sure if all compilers work this way but i got to assume that they do, this is pretty standard fare to create portable and efficient automatic threading. the ability to split the work down in not too granular chunks combined with the ability to schedule work onto available processor resources makes it easy for the programmer to get threaded code without knowing threading.

its often advantageous to run more threads than cores in the system if any of threads will do things like IO etc. other than that matching the threads to the available run time resources is what a friendly compiler will do.

some people do explicit thread directives and explicitly set how many cores a loop should be unrolled onto. i generally dont do that, if im down at that layer i might as well directly write against the thread library and manage it myself without compiler figuring it out for me.

i doubt cross compilation across architectures works well for automatic multithreading but i could be wrong. i assume this was the what was asked when cross compilation was mentioned.

the nvidia gpu architecture is pretty interesting in that they allow low level primitives to do direct scheduling on gpu's. and even their mid range graphics cards supports cuda with 120+ stream processors you can schedule tasks on directly.
Software Development Tools

they even have a partner who wrote an auto-parallelizing fortran to C translator which sounds pretty cool. otherwise you can directly create your own thread directives and just run it directly against the platform. with the hundreds of megs of fast ddr memory on graphics cards combined with bulk pci-x transfers it should be pretty darn fast for these classes of operations. these cores interest me in that they are optimized for data flow and data throughput -pretty ideal if you do large data transformation tasks.

its been a while i did fortran and compilers but i try to keep the old skills still up to date. have fun and i hope you dont mind me sharing my perspective and experience!

abraxalito · 2010-06-30 4:48 pm

parb said:
hope you dont mind me sharing my perspective and experience!

Mind? MIND? That's the kind of ultra-informed inside story I was hoping for - thanks very much. Answers like that are what makes diyaudio such a cool place to hang out.

parb · 2010-06-30 11:23 pm

one of my biggest headaches was the languages who did garbage collection. fortan isn't one of them but java etc are. they typically walked through their object space in memory looking for orphaned objects. the problem this caused was a massive pollution of the memory cache, it turned out that for memory intense tasks we got the caches so 'dirty' that we couldn't get throughput up enough to be meaningful. of course this didn't matter for io bound applications (networking, storage etc) but it matters greatly for things like really large arrays of objects and running math on them. fortran with its strong typing and relatively straightforward addressing was easier for a compiler to optimize in that sense. but my true love is c. There is nothing that can be done in fortran that i cant do just as efficiently in C (or so i like to believe)...

Biquad calculations for custom filter design

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member