-
Notifications
You must be signed in to change notification settings - Fork 0
/
brutefir.html
2456 lines (2272 loc) · 110 KB
/
brutefir.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="initial-scale=1, width=device-width">
<title>BruteFIR</title>
<style media="screen, print">
table, th, td {
border: 1px solid black;
}
</style>
</head>
<body>
<h1>BruteFIR</h1>
<h2>Table of contents</h2>
<ul>
<li><a href="brutefir.html#news">News</a>
<li><a href="brutefir.html#docage">Note on the documentation's age</a>
<li><a href="brutefir.html#whatis">What is it?</a>
<li><a href="brutefir.html#good">What is it good for?</a>
<li><a href="brutefir.html#bruteconv">BruteFIR convolution</a>
<ul>
<li><a href="brutefir.html#bruteconv_1">The problem of complexity</a>
<li><a href="brutefir.html#bruteconv_2">Problems with long FFTs</a>
<li><a href="brutefir.html#bruteconv_3">Partitioned convolution</a>
<li><a href="brutefir.html#bruteconv_4">Optimizing where it counts</a>
<li><a href="brutefir.html#bruteconv_5">Conclusion</a>
</ul>
<li><a href="brutefir.html#download">Where can I get it?</a>
<li><a href="brutefir.html#howfast">How fast is it?</a>
<ul>
<li><a href="brutefir.html#throughput">How high throughput can I get?</a>
<li><a href="brutefir.html#lowdelay">How low I/O delay can I get?</a>
</ul>
<li><a href="brutefir.html#hardware">Hardware considerations</a>
<li><a href="brutefir.html#config">Configuring and running</a>
<ul>
<li><a href="brutefir.html#config_1">General settings</a>
<li><a href="brutefir.html#config_2">General structure syntax</a>
<li><a href="brutefir.html#config_3">Coeff structure</a>
<li><a href="brutefir.html#config_4">Input and output structure</a>
<li><a href="brutefir.html#config_5">Filter structure</a>
<li><a href="brutefir.html#config_6">Configuration file example</a>
</ul>
<li><a href="brutefir.html#bfio">I/O modules</a>
<ul>
<li><a href="brutefir.html#bfio_alsa">ALSA sound card I/O (alsa)</a>
<li><a href="brutefir.html#bfio_oss">OSS sound card I/O (oss)</a>
<li><a href="brutefir.html#bfio_jack">JACK audio server I/O (jack)</a>
<li><a href="brutefir.html#bfio_file">Raw PCM file I/O (file)</a>
<li><a href="brutefir.html#bfio_own">Writing your own I/O module</a>
</ul>
<li><a href="brutefir.html#bflogic">Logic modules</a>
<ul>
<li><a href="brutefir.html#bflogic_cli">Command line interface (cli)</a>
<li><a href="brutefir.html#bflogic_eq">Run-time equalizer</a>
<li><a href="brutefir.html#bflogic_own">Writing your own logic module</a>
</ul>
<li><a href="brutefir.html#tuning">Tuning</a>
<ul>
<li><a href="brutefir.html#tuning_1">Realtime index</a>
<li><a href="brutefir.html#tuning_2">FFTW wisdom</a>
<li><a href="brutefir.html#tuning_3">Low latency patch</a>
<li><a href="brutefir.html#tuning_4">Sample clock problems</a>
<li><a href="brutefir.html#tuning_5">Double precision or not</a>
<li><a href="brutefir.html#tuning_6">Choosing number of partitions</a>
<li><a href="brutefir.html#tuning_7">Realtime issues</a>
</ul>
<li><a href="brutefir.html#features">Request features</a>
<li><a href="brutefir.html#references">References</a>
</ul>
<p>
<h2 id="news">News</h2>
<p><strong>2016-11-15</strong><br>
Re-released 1.0o with corrected version number in the output (it
incorrectly said 1.0m before).
<p><strong>2016-08-09</strong><br>
Maintenance release 1.0o. Second this day, I was a bit trigger happy on
the first. Here I've put in some minor bugfixes received from the Debian
package maintainer.
<p><strong>2016-08-09</strong><br>
Maintenance release 1.0n, no functional change.
<p><strong>2013-11-29</strong><br>
There was still a typo in the last uploaded 1.0m affecting
SSE2. Uploaded fix.
<p><strong>2013-11-28</strong><br>
There was a typo in the last uploaded 1.0m release causing the
S24_LE/S24_3LE formats to break for input. So if you downloaded 1.0m
yesterday please do it again.
<p><strong>2013-11-27</strong><br>
BruteFIR v1.0m. Fixed an SSE2 bug introduced in 1.0l. Added
'safety_limit' feature which can be used to protect your expensive
speakers (and sensitive ears). Also fixed a rare race condition bug
and further synchronized sample formats with ALSA, so now S24_4LE
means low 24 bits of 32 bit word. Thus if you used S24_4LE before you
should use S32_LE now to get the old behavior.
<p><strong>2013-10-06</strong><br>
BruteFIR v1.0l. Refreshed code to compile well on x86-64, dropped
3Dnow support and replaced the hand-coded SSE with SSE C code, and
refreshed JACK and ALSA I/O modules to catch up with changes in the
APIs. Also fixed a filter indexing bug in the <code>cffa</code> CLI
command.
<p><strong>2009-03-31</strong><br>
BruteFIR v1.0k. Refreshed JACK and ALSA I/O modules to catch up with
changes in the APIs.
<p><strong>2009-03-05</strong><br>
BruteFIR v1.0j. Fixed a memory leak in the CLI.
<p>
As you may have noted, I do not any longer actively develop BruteFIR
further. From my point of view the software is "complete". I have had
plans to start a next generation BruteFIR from scratch with a more
modern design (using threads instead of forked processes etc), but
priorities in life change, and I do not any longer have much time to
write code so it is not likely to happen.
<p>
However, I'm happy to see that there are many BruteFIR users out
there. Do continue to report bugs as my intention is to keep the
code working and fix any bugs that arise.
<p><strong>2006-10-08</strong><br>
BruteFIR v1.0i. Minor fixes in CLI. Sub-sample delay now works also
with negative delays.
<p>
There's also some interesting patent news, actually this is old
news, but I did not know about it until now - the patent EP0649578 has
been revoked after an opposition. It was argued that from the existing
prior art (of which most is referenced here), the non-uniform
partitioned part of the patent lacks inventive step. Additionally,
there is a testimony that claims that the "invention" was actually
exposed in advance - an academic person at a Danish university
explained the idea to the people that then went home and filed the
patent which has plagued the industry and open-source world for so
long. A theft and lockdown of ideas which we so often before have seen
in the world of patents. Anyway, based on this opposition, the EPO has
revoked the patent.
<p>
For BruteFIR this does not mean anything since it employs uniform
partitioned convolution. However, the non-uniform partitioned
convolution algorithm is probably free to use in open-source
software. There are still patents on this in other countries (such as
the US), but with the corresponding patent revoked in Europe, they
will be hard to defend. Note that I'm not a patent lawyer, so if you
really are going to implement non-uniform partitioned convolution I
recommend to consult a professional first, because there is no 100%
clear prior art as in the uniform partitioned convolution case.
<p>
In the future, there might be a non-uniform version of BruteFIR, but
not likely in the near future since it will require new design from
the ground and up. The convolution principle is the same, but
implementation is much different with non-uniform partitions, since
you have to perform different size FFTs in parallel. The simplest idea
would be to simply run the same convolution engine in several "layers"
with different partition sizes and mixing together the result in the
end. This way the implementation difference is small and could be
realized quite easily in BruteFIR, but efficiency will suffer. I
probably will rather spend time to do it from the ground and up. But
not this year.
<p><strong>2006-07-12</strong><br>
BruteFIR v1.0h. Added a sub-sample delay function (that is delays
smaller than one sample can be specified), support for text format in
the file I/O module, and support for naming ports in the JACK I/O
module.
<p><strong>2006-03-30</strong><br>
BruteFIR v1.0g. Fixed input mixer and delay setting bugs.
<p><strong>2005-08-11</strong><br>
BruteFIR v1.0f. Fixed a filter parse bug.
<p><strong>2005-06-28</strong><br>
BruteFIR v1.0e. Fix to work with GCC 4.0.
<p><strong>2005-06-12</strong><br>
Released a minor maintenance release, 1.0d. Contains some minor
adjustments to the JACK I/O module, and a fatal bug fix concerning
multiple inputs/outputs, which was introduced in 1.0b.
<p><strong>2005-01-04</strong><br>
BruteFIR v1.0c. Mistake in 1.0b caused the CLI module not accepting
return characters, causing problems for telnet operation. Fixed that.
<p><strong>2004-11-21</strong><br>
BruteFIR v1.0b. Updated the JACK I/O module, it is now possible to run
several BruteFIR instances using JACK at the same time, and it is not
necessary to connect to external ports at startup. The CLI can now
take commands from a serial line. Additionally, a couple of remaining
bugs in the equalizer module have been fixed.
<p><strong>2004-08-07</strong><br>
BruteFIR v1.0a. Minor update, removed the coefficient set limit and
updated the example configuration files in the package.
<p><strong>2004-04-21</strong><br>
BruteFIR v1.0. I felt it was time to release 1.0 now. I have fixed up
the code so it compiles on FreeBSD and Solaris again, and I added an
OSS module, which makes BruteFIR truly usable on FreeBSD platforms.
<p><strong>2004-02-22</strong><br>
BruteFIR v0.99n. As suspected, the merge function was not good enough,
and has now been removed. However, instead, a cross-fade algorithm has
been added, which indeed is a bit costly in terms of CPU time, but
makes coefficient changes truly seamless.
<p><strong>2004-01-17</strong><br>
BruteFIR v0.99m. Fixed a few bugs, and updated the ALSA code to
support the 1.0 version of the API. Now it is also possible to make more
time-precise CLI scripting. I'm suspecting that the merge function is
not good enough to be very useful, I may remove it or replace it with
a better sounding (but inefficient) cross-fade algorithm.
<p><strong>2003-10-26</strong><br>
BruteFIR v0.99l. Added a function to hide discontinuities that may
occur when filter coefficients are changed in runtime (function is
called "merge"). Also added a skip option to coefficient loaded, to
skip a given number of bytes in the beginning of a file.
<p><strong>2003-08-10</strong><br>
BruteFIR v0.99k. This is a maintenance release, which fixes a few
bugs, including a severe powersave bug which could cause unexpected
and very loud noise come out. It also adds an option to run in daemon mode.
<p><strong>2003-07-11</strong><br>
BruteFIR v0.99j. Now we are getting near a 1.0 release. This release
contains quite many new features, and bug fixes. Some feature
highlights: BruteFIR now employs FFTW3, there is support for 32 and 64
bits in the same binary and buffer over/underflows can be
ignored. Among important bug fixes are that FFTW wisdom is now stored
properly, so it can be re-used more often, and the equalizer module
now sets the magnitude properly at the edges.
<p><strong>2003-02-11</strong><br>
BruteFIR v0.99i. I released the h-version a bit too early, lots of
small but significant mistakes followed. This version fixes those
(hopefully).
<p><strong>2003-02-09</strong><br>
BruteFIR v0.99h. A couple of bug fixes associated to the new callback
I/O. It also adds support for native endian and auto sample formats,
and a simple automatic load balancer for multi-processor machines.
<p><strong>2003-02-02</strong><br>
BruteFIR v0.99g. This release adds support for callback I/O. One
callback I/O module is available, supporting JACK. This support means
that the program has went through quite radical reorganizations, so
something might be broke. If you discover any problems, please let me know.
<p><strong>2003-01-05</strong><br>
BruteFIR v0.99f. Minor peak meter adjustment and bug fix.
<p><strong>2002-12-25</strong><br>
BruteFIR v0.99e. Lots of tuning have been made to work better with
sound card I/O. It should now be more reliable in low latency
configurations. The release also includes some various minor
improvements and bug fixes.
<p>
For those that find the default configuration file unnecessary and
just in the way, there is now the <code>-nodefault</code> command line
option, which will cause BruteFIR to skip the default configuration
file.
<p><strong>2002-11-28</strong><br>
BruteFIR v0.99d. Fixes yet another bug in the ALSA code, which caused
the software not to work with hardware with odd period sizes, such as
some (all?) ice1712-based cards. The real-time index has also been
much simplified and improved in terms of reliability, and a power-save
feature was added.
<p>
Sometime soon, there will be 1.0...
<p><strong>2002-10-10</strong><br>
BruteFIR v0.99c, is an important bug fix release. Among other fixes,
it fixes the slightly embarrassing bug of incorrect reading of
3 byte 24 bit formats. Apart from many bug fixes, it adds double
buffer support to the equalizer module, and a simple script function
to the CLI. The risk of buffer underflow at startup has also been
strongly reduced.
<p><strong>2002-09-12</strong><br>
BruteFIR v0.99b, fixed a serious bug in the ALSA code, which caused
buffer underflow when the software buffer size was larger than the
hardware buffer size.
<p><strong>2002-08-25</strong><br>
BruteFIR v0.99a, a couple of minor bug fixes, discovered during the
development of <a
href="http://www.ludd.ltu.se/~torger/almusvcu.html">AlmusVCU</a>.
<p><strong>2002-08-04</strong><br>
This new release (v0.99) contains a first version of an equalizer
module, which allows equalization to be changed in runtime. Now the
I/O delay is fixed, always exactly twice the filter block length
(if the sound card hardware is properly designed). Good for
synchronization with other audio processors, or clustering. There is
also a slight change in configuration file format, so you know why it
will complain when run with an old configuration file.
<p><strong>2002-07-26</strong><br>
Added a minor feature that proved necessary for some applications,
such as Ambisonics. This feature makes it is possible to multiply
inputs/outputs in mixing with negative values, not just positive. The
new version is BruteFIR 0.98e. An invalid version was available a few
hours during this day (forgot to include some CLI patches), so if you
downloaded your v0.98e at this date, download it again.
<p><strong>2002-07-21</strong><br>
Two bug fixes in this new release, BruteFIR 0.98d. The first concerns
scaling of coefficient parameters, where PCM coefficients where
incorrectly scaled. The other fix is in the ALSA I/O module, which
could at some occasions fail to set the sample rate.
<p><strong>2002-06-14</strong><br>
BruteFIR 0.98c, another small step towards 1.0. This contains an
important bugfix. Earlier versions could mix up the mix buffers which
caused looping sound with some filter configurations, this is now
fixed. The common mistake (at least for me) to link a 32 bit BruteFIR
with a 64 bit FFTW or the other way around is now taken care of.
<p><strong>2002-05-05</strong><br>
BruteFIR 0.98b. The sample rate monitoring added in 0.98a is now
optional, through the option monitor_rate. Also support for SSE2 for
Pentium 4 processors is implemented (only used when compiled with
double precision). It is also possible to compile and run on Solaris
with Sparc processors.
<p><strong>2002-04-16</strong><br>
Yet another of the usual minor updates: BruteFIR 0.98a. This fixes a
minor bug which could cause stray processes to be left after exit. It
also improves the real-time index calculation so it works properly on
SMP, and the program now exits with an error when sample rate is
changed in runtime. There are now interpretable exit codes from the
program as well, so one can now why it exited.
<p><strong>2002-03-25</strong><br>
BruteFIR 0.98: This new release supports virtual inputs and outputs,
which can be used to control delay of individual outputs even if they
are mixed to the same physical output.
<p><strong>2001-12-20</strong><br>
Another bugfix release, 0.97d. Also added a <code>-quiet</code> command
line parameter to suppress title, warnings and informational messages
at startup.
<p><strong>2001-12-17</strong><br>
Due to popular demand, the ALSA I/O module has got support for
accessing the software modes of the ALSA library. The new release is
0.97c.
<p><strong>2001-12-16</strong><br>
Ooops. The new sample format handling was not as good as I initially
thought. Now that has been fixed. Oh, clipping for 32 bit formats
works again. I hope I did not burst anyone's ears (other than
mine). The release version is 0.97b.
<p><strong>2001-12-15</strong><br>
Some major bugs was introduced in 0.97, hopefully most of them has
been squashed in this new release, 0.97a.
<p><strong>2001-12-09</strong><br>
BruteFIR 0.97: a new release with lots of major changes. The software is
now much more modular. It uses modules for input and output, ALSA and
file I/O being the first modules available. It also supports
logic modules, the old BruteFIR CLI being the first example. The logic
modules can be used to achieve adaptive filtering. The new module
architecture will probably need some time to stabilize, and due to the
large amount of changes to the code, there is a great risk that this
new version is less stable than the last. A few details in the
configuration file format has changed as well, for which the
documentation has been updated. The documentation for how to program a
BruteFIR module is not yet available though.
<p><strong>2001-11-04</strong><br>
Added a todo list. Any suggestions are welcome of course.
<p><strong>2001-10-27</strong><br>
Added some quick and dirty benchmarks, and added some new
documentation. I made a low latency benchmark due to popular demand,
and the interesting result is that it is possible to get as low as
three milliseconds I/O delay, which is much lower than what I expected.
<p><strong>2001-09-27</strong><br>
New release, BruteFIR 0.96a. Some minor bugfixes, and at last
processor capability detection code has been included, so BruteFIR
will detect SSE or 3DNow, and use the optimized code accordingly.
<p><strong>2001-08-26</strong><br>
Updated documentation to cover all the new features of BruteFIR 0.96.
<p><strong>2001-08-20</strong><br>
BruteFIR 0.96 has been released, with a few important bugfixes, but
also much new features, which not yet has been documented here. It is
now possible to make filter networks, and have different length
on different filters.
<p><strong>2001-07-18</strong><br>
A new release, BruteFIR 0.95b, which contains an important bugfix is
available for download. It fixes a block bounds violation error when
converting from 32 bit integers to floating point. It also contains
some tuning of realtime priorities.
<p><strong>2001-06-10</strong><br>
Some minor updates to the documentation.
<p><strong>2001-06-03</strong><br>
A bugfix release, BruteFIR 0.95a, is available for download. It fixes
a bug which caused the program to crash when long filters in raw
format was read.
<p>
The documentation is now up to date again.
<p><strong>2001-05-26</strong><br>
New release, BruteFIR 0.95. This includes some new features, for
example support for changing delay in runtime and support for
non-interleaved sound cards. An important bug fix has also been
applied, when mixing files and sound cards for inputs/outputs trouble
could occur, but that should be fixed now.
<p>
Again, the documentation on this page is not entirely up to date with
the software itself.
<p><strong>2001-04-11</strong><br>
BruteFIR 0.94a released, which is a bugfix release. A severe bug in
the ALSA support code caused the error "Hardware does not support
enough fragments." with common sound cards. Now it is gone. Still
there is some work to do on the ALSA support code, like adding support
for cards with non-interleaved buffer layout (like the RME9652).
<p><strong>2001-04-08</strong><br>
Major changes and cleanups of this page has been done, and the source
code has been re-released. The new version is 0.94, and contains a new
improved convolution algorithm with hand-coded assembler optimizations
for Intel's SSE and AMD's 3Dnow. With this, BruteFIR is now capable of
even higher throughput.
<h2 id="docage">Note on the documentation's age</h2>
<p>
Note that the core of this documentation was written 1999 — 2001
and is thus old. It's up to date regarding how to configure BruteFIR,
but there's many references to old kernel versions and old CPUs
embedded in here.
<h2 id="whatis">What is it?</h2>
<p>
BruteFIR is a software convolution engine, a program for
applying long FIR filters to multi-channel digital audio, either
offline or in realtime. Its basic operation is specified through a
configuration file, and filters, attenuation and delay can be changed
in runtime through a simple command line interface. The FIR filter
algorithm used is an optimized frequency domain algorithm, partly
implemented in hand-coded assembler, thus throughput is extremely
high. In realtime, a standard computer can typically run more than 10
channels with more than 60000 filter taps each.
<p>
Through its highly modular design, things like adaptive filtering,
signal generators and sample I/O are easily added, extended and
modified, without the need to alter the program itself.
<p>
BruteFIR is free and open-source. It is licensed through the GNU
General Public License <a href="brutefir.html#gpl">[6]</a>.
<p>
The preferred operating system platform for the program is Linux
<a href="brutefir.html#linux">[11]</a>, but it is easily
ported to other Unixes as well, and supports for example FreeBSD out
of the box. BruteFIR uses the high-performance
FFTW library <a href="brutefir.html#fftw">[7]</a> for the Fast Fourier
Transform (FFT, <a href="brutefir.html#cooley_tukey">[5]</a>)
calculations, and ALSA, the Advanced Linux Sound
Architecture <a href="brutefir.html#alsa">[2]</a>, is the preferred
way of interfacing sound cards, although OSS, Open Sound System <a
href="brutefir.html#oss">[25]</a>, is supported as well. The main
features are:
<ul>
<li>Designed for realtime filtering of HiFi quality digital audio
<li>Up to 256 inputs and 256 outputs
<li>Input/output provided by external modules for maximum flexibility
<ul>
<li>Default I/O modules provide support for sound cards and files
<li>Access multiple I/O modules (= several sound cards / files) at the
same time
<li>8 - 24 bit audio at any rate supported by sound cards
<li>Easy-to-use C language API to create your own I/O modules, for
example to support more file formats, other sound card APIs, or
generate test signals
</ul>
<li>Mix/copy channels before and/or after filtering
<li>Cascade filters or build complex filter networks
<li>Simple C language API to create logic modules, to add new
functionality
<ul>
<li>Create your own logic module, for example to do adaptive filtering
<li>Provided is a logic module which implements a CLI accessible
through telnet to manage runtime settings, and a dynamic equalizer.
<li>Toggle/change filter in runtime
<li>Alter attenuation for each individual input and output in runtime
<li>Alter delay for each individual input and output in runtime
<li>Sub-sample delays are possible
</ul>
<li>Filter length limited only by processor power and memory
<li>Typical filter lengths are in the range 2048 - 262144 taps
<li>Reasonable low I/O-delay (typically 200 ms)
<li>Fixed I/O-delay, thus possible to sample-align with other processors
<li>Cross-fade for seamless filter coefficient changes.
<li>Re-dithering of outputs (HP TPDF)
<li>Overflow protection and monitoring
<li>32 or 64 bit floating point internal resolution.
<li>Supports multiple processors
</ul>
<h2 id="good">What is it good for?</h2>
<p>
A few examples of applications where BruteFIR could be a central component:
<ul>
<li>Digital crossover filters
<li>Room equalization
<li>Cross-talk cancellation
<li>Wavefield synthesis
<li>Auralization
<li>Ambiophonics
<li>Ambisonics
</ul>
Among these, room equalization and auralization needs the
longest FIR filters in the common case. Many applications can do with
quite short filters actually, but the thing is that you will probably
not need to compromise on the filter lengths when you use BruteFIR,
even when sample rates go up. However, BruteFIR is pretty useless by
itself, since it is only a FIR filter engine. It does not provide any
filter coefficients, thus it is not a filter design program. Also, due to
its relatively high I/O-delay, BruteFIR is most suited for
applications when the input signal is not live.
<p>
If you are interested in room equalization, my old NWFIIR project <a
href="brutefir.html#nwfiir">[18]</a> might be of interest. It's a bit
dated though. A better program for room equalization is Denis Sbragion's DRC <a
href="brutefir.html#drc">[22]</a>.
<h2 id="bruteconv">BruteFIR convolution</h2>
<p>
The main design goal of BruteFIR is to achieve as high throughput as
possible when filters are long (longer than 10000 taps). This
means that the filter algorithm must be very fast, since it will be
consuming almost all processor time of the whole program. BruteFIR's
convolution algorithm is an example of a situation where a
theoretically less efficient algorithm is faster in practice, because
it is easily optimized and hides performance problems of more complex
components.
<p>
Frequency domain algorithms for convolution is much faster than the
straight-forward time domain one when filters are long. The well known
overlap-save algorithm is used as the base in BruteFIR's
convolution. However, there are practical problems with this
algorithm as we will see.
<h3 id="bruteconv_1">The problem of complexity</h3>
<p>
Efficient convolution is done in the frequency domain and therefore
an FFT algorithm is needed. The FFT calculations occupy typically more than
90% of all processing time when plain overlap-save is
employed. Unfortunately, FFT it is not easy to implement. There exist
numerous implementations which vary greatly in performance, which is
one proof of the complexity. Since it takes up almost all processing
time, we must optimize it in order to make the convolution
faster. This leaves us with a quite hard optimization problem.
<p>
One way to optimize is to code assembler by hand and try to be better
than the compiler. Modern processors for personal computers like
Intel's Pentium III <a href="brutefir.html#intel">[10]</a> or AMD's
Athlon <a href="brutefir.html#amd">[1]</a> has custom SIMD instructions
(Single Instruction Multiple Data), which allows for a single
instruction to operate on more than one data element at a time. For
example, a single instruction may add together four or eight floating
point numbers. Typically, one can improve the performance of an
algorithm four times when using these instructions. They are not used
by common compilers like GCC (GNU Compiler Collection <a
href="brutefir.html#gcc">[9]</a>), meaning that we have a good opportunity
to write assembler code that will with a wide margin outperform code
generated by the compiler. Most FFT libraries are written in C, and
thus does not use these efficient SIMD instructions. So,
theoretically, we could implement an FFT algorithm using SIMD
instructions and beat the ones already available. However, we are
going for a simpler approach as we shall see. Since one of the
design goals of BruteFIR is to be fairly portable, we want to make any
assembler implementation small and simple, so it easily can be ported
to other processor architectures. Maybe 'small', but certainly not
'simple' would be applicable on an assembler implementation of FFT. In
conclusion, we find optimization with assembler as an attractive
method to increase performance of existing algorithms. However, the
algorithm we need to optimize, FFT, is quite complex and thus not an
attractive target for optimization.
<p>
One of the fastest FFT libraries
available is FFTW <a href="brutefir.html#fftw">[7]</a>, <a
href="brutefir.html#frigo_johnson">[8]</a>, which is used by
BruteFIR. There are more efficient FFT libraries out there (?), but
they are often limited to short lengths (typically less than 8192), or
are not free software nor open-source, which is a requirement of the
BruteFIR project.
<p>
<h3 id="bruteconv_2">Problems with long FFTs</h3>
Many of the fastest FFT implementations support only shorter filter
lengths (djbfft <a href="brutefir.html#djbfft">[3]</a> being one
example), and those that support long
lengths may behave poorly on some architectures. One example is FFTW
which on my 900 MHz AMD Athlon test system gets a large performance
dip when FFT lengths become larger than 32768
(real-valued transforms). On the test system, a 262144 point FFT is 30
times slower than a 32768 point, which theoretically should be only 10
times. Although the behavior is more stable on my 550 MHz Pentium III
test system, performance drops more than O(n * log2(n)) which is the
complexity of the FFT algorithm. Note that these tests were performed
using FFTW2.
<p>
These performance problems is of course due to memory accesses, and
poor cooperation between the hardware caching architecture and the
software. When the data of the algorithm exceeds the cache size, the
problem becomes obvious.
<p>
Both Pentium and Athlon architectures allows for giving the cache
hints from the software to reduce problems in these situations, but
this must be done in assembler, and is therefore seldom used.
<p>
Apart from performance problems, long FFTs include more
multiplications and scalings which induces a larger quantization
error. This is however a minor problem (?).
<h3 id="bruteconv_3">Partitioned convolution</h3>
<p>
We have seen that the central algorithm of fast convolution, the Fast
Fourier Transform, is complex to implement and optimize. We have also
seen that the need of long FFTs reduces the choices of available
implementations and that the existing can behave poorly on some
hardware architectures. A modified fast convolution algorithm that
uses shorter FFTs, and where most time is spent in code which is
small and easily optimized, would be ideal.
<p>
Many have worked on improving the standard frequency domain
convolution algorithms for different purposes. The central idea found
in many of these improvements, is that the impulse response, that is
the filter, is partitioned into several smaller parts. When each
part is filtered with the input, the results delayed suitably and finally
added together, one gets the same result as when processing the whole
filter at once. As far as I know, the earliest user of this simple but
powerful concept is T.G. Stockham <a
href="brutefir.html#stockham">[16]</a>, who published his results only
one year after the famous Cooley and Tukey FFT paper <a
href="brutefir.html#cooley_tukey">[5]</a>. The concept can
be used to solve several problems. Stockham used it for saving memory,
but in later work made in the eighties and early nineties, at the time
when realtime DSP became feasible for the first time, it was stated
that it can also be used to reduce quantization errors, reduce
I/O-delay, and adapt to optimal FFT lengths of a specific
implementation. All these improvements are described by J.S. Soo and
K.K. Pang <a href="brutefir.html#soo_pang_1">[14]</a>, <a
href="brutefir.html#soo_pang_2">[15]</a>. Other realtime partitioned
convolution pioneers are B.D. Kulp <a
href="brutefir.html#kulp">[17]</a>, P.C.W. Sommen <a
href="brutefir.html#sommen_1">[12]</a>, <a
href="brutefir.html#sommen_2">[13]</a> and J.M.P. Borrallo and
M. G. Otero <a href="brutefir.html#borrallo_otero">[4]</a>. Their
work is a good place to start reading for the one interested in
getting a more detailed description of partitioned convolution. The
convolution algorithm in BruteFIR is conceptually exactly the same as
the one found in these papers.
<p>
When partitioned convolution is used, something interesting happens in
the processing time distribution of the algorithm. The major part of
processing is moved from the FFT algorithm, to the trivial operation
of convolution in the frequency domain which is simply
multiplication. The more parts we split the impulse response into, the
more convolution and less FFT is done. Naturally the FFTs get shorter,
and thus we get rid of the problems associated to long FFTs. We now
realize that partitioned convolution is the answer to our wishes, we
do not need long FFTs and it becomes less important to optimize the
FFT algorithm.
<h3 id="bruteconv_4">Optimizing where it counts</h3>
<p>
We notice that we will earn most from optimizing the operation where a
segment of input converted to the frequency domain is multiplied with
the corresponding part of the filter also in the frequency
domain. The result is then added to the output. When the data format
is half-complex, a format used by most real-valued FFTs, The straight-forward
implementation look like this when programmed in C:
<p>
<pre>
d[0] += b[0] * c[0];
for (n = 1; n < n_fft / 2; n++) {
d[n] += b[n] * c[n] - b[n_fft - n] * c[n_fft - n];
d[n_fft - n] += b[n] * c[n_fft - n] + b[n_fft - n] * c[n];
}
d[n] += b[n] * c[n];
</pre>
<p>
<code>b</code> is the input, <code>c</code> is the filter coefficients, and
<code>d</code> is the output. As we see, this is a very short and simple
algorithm, which is easy to implement in assembler. There are a couple
of problems though. The data in each array is accessed from the tail
and the front at the same time. It would be better for the cache to
localize the accesses, and move from front to end only. It is also a
problem that the data is accessed both in forward and reverse order
(both 0,1,2,3 and 3,2,1,0), since we want to used SIMD
instructions. To solve the problem, we need to reorder the data. This
will only be necessary to do once with the filter coefficients, so it is
free. For the input however, we need to do this once after each forward
transform, and for the output we need to restore the half-complex
order prior to each inverse transform. In BruteFIR the input reordering
is put into the mixing and scaling step, and the output reordering in
the quantization step, so the cost is next to nothing. Below is a
C implementation of the previous algorithm, when data has been
reordered to better fit SIMD instructions and to improve the memory
access pattern:
<p>
<pre>
d1s = d[0] + b[0] * c[0];
d2s = d[4] + b[4] * c[4];
for (n = 0; n < n_fft; n += 8) {
d[n+0] += b[n+0] * c[n+0] - b[n+4] * c[n+4];
d[n+1] += b[n+1] * c[n+1] - b[n+5] * c[n+5];
d[n+2] += b[n+2] * c[n+2] - b[n+6] * c[n+6];
d[n+3] += b[n+3] * c[n+3] - b[n+7] * c[n+7];
d[n+4] += b[n+0] * c[n+4] + b[n+4] * c[n+0];
d[n+5] += b[n+1] * c[n+5] + b[n+5] * c[n+1];
d[n+6] += b[n+2] * c[n+6] + b[n+6] * c[n+2];
d[n+7] += b[n+3] * c[n+7] + b[n+7] * c[n+3];
}
d[0] = d1s;
d[4] = d2s;
</pre>
<p>
The above function is easily converted into assembler using Intel's
SSE instructions, or AMD's 3Dnow instructions, with cache hint
instructions. The key loop (which is unrolled to further improve
performance) becomes less than 50 lines long.
<p>
It is interesting that partitioned convolution makes much more memory
references than ordinary overlap-save. In the most simple algorithm
analysis, only the number of mathematical operations (like
multiplications and additions) are considered when evaluating
performance. Better analysis also counts the number of memory
references, but unfortunately that is not enough considering the
modern computer architecture; it is also of profound importance to
take <em>how</em> the accesses are done into consideration. One bad
reference can be worse in terms of performance than ten good ones on a
modern computer.
<h3 id="bruteconv_5">Conclusion</h3>
<p>
By implementing partitioned convolution we have avoided the need of
using long FFTs, and moved the major part of the processing time from
the FFT to a simple multiplication loop. By reordering data after the forward
transform and restoring it prior to inverse transform, the
multiplication loop can be easily realized with SIMD instructions, and
thus become very efficient. On the 900 MHz AMD Athlon test system,
filtering of a 131072 tap long filter is twice as fast when 16
partitions of 8192 taps each are used instead of a single
partition (note: this test case is exceptional, the performance improvement
is less in the common case). This despite the new algorithm uses more
memory references and more mathematical operations.
<p>
Apart from the improvement in throughput, we also get lower I/O-delay
(equals about twice the partition length), lower memory consumption,
and more flexible filter length options. A 140000 tap filter would
require a 262144 tap filter if ordinary overlap-save was used, but
with partitioned convolution we can use 18 partitions of 8192 taps,
and then get a gross performance improvement, coupled with delay
reduction.
<p>
Still, one must not over-estimate partitioned convolution. If there really
is an optimal FFT algorithm available, ordinary overlap-save will
certainly outperform the partitioned algorithm. An example of an
assembler-optimized FFT algorithm can be found in the non-free and
non-portable Intel Native signaling processing library <a
href="brutefir.html#nsp">[19]</a>.
<h2 id="download">Where can I get it?</h2>
<p>
You are free to <a href="files/brutefir-1.0o.tar.gz">download version
1.0o</a>.
<p>
The package contains the source-code, you will need a supported
platform to run it on (Linux is recommended, but FreeBSD or Solaris
should work out of the box too, it is not as closely maintained
though). Apart from the basic stuff you must also have FFTW3 installed
(note that FFTW2, as used by old versions of BruteFIR, won't
work). FFTW3 must be compiled for both double and single precision.
<p>
If you want sound card support, it is recommended to use ALSA on Linux
platforms, and when that is not available, OSS can be used.
<p>
If you want to use the JACK support, you need an up to date version of
JACK installed.
<p>
Be sure that you use an official GCC compiler when compiling
BruteFIR. One user reported bad sound quality (noise artifacts in the
BruteFIR output), and it was shown that he had used GCC 2.96 (not an
official version), that caused errors in the floating point
calculations of BruteFIR.
<p>
The package does not yet contain configure scripts or other nice
things to make compiling easier. However, with some luck it should
work simply by typing 'make'. You can also view the Makefile to see
what compile options there are. If you have any questions, just mail
me, <a href="mailto:[email protected]">[email protected]</a>.
<h2 id="howfast">How fast is it?</h2>
<p>
BruteFIR's main feature is that is fast. It's brutally fast. The key
component making BruteFIR fast is the convolution algorithm described
above.
<p>
<strong>Note:</strong> the test descriptions here are a bit dated, made
using an old version of BruteFIR. However, the results should provide
a rough idea of what BruteFIR can do in terms of throughput. The
example configuration files have been updated to work with the current
version.
<h3 id="throughput">How high throughput can I get?</h3>
<p>
With a <a href="massive_config">massive convolution
configuration file</a> setting up
BruteFIR to run 26 filters, each 131072 taps long, each connected to
its own input and output (that is 26 inputs and outputs), meaning a
total of 3407872 filter taps, a 1 GHz AMD Athlon with 266 MHz DDR RAM
gets about 90% processor load, and can successfully run it in real
time. The sample rate was 44.1 kHz, BruteFIR was compiled with 32 bit
floating point precision, and the I/O delay was set to 375 ms. The
sound card used was an RME Audio Hammerfall.
<h3 id="lowdelay">How low I/O delay can I get?</h3>
<p>
BruteFIR is mainly designed for high throughput, not low
delay. However, there is an interest of using BruteFIR for low delay
convolution anyway, so here are some benchmarks so you know what to
expect. Partitioned convolution can indeed allow for quite low delay,
very low if the processing power is available, and the filters are not
too long.
<p>
Below is an example of a simple cross-talk cancellation
application running on a 1 GHz AMD Athlon with 266 MHz DDR RAM and an
RME Audio Hammerfall sound card. You can <a
href="xtc_config">download the cross-talk cancellation
configuration file</a> that was used if you want to test
yourself. There are only four filters and their length are no more
than 8192 taps (note: the example files included in the package are
only 4096 taps long, as seen in the updated example configuration
file), so it is indeed a very light application, which is a
requirement if you want very low delay, since partitioned convolution
does not scale very well with low delays (meaning a large number of
partitions). The sample rate in these tests is 44.1 kHz, and BruteFIR
was running with 32 bit floating point precision.
<p>
<table>
<tr>
<th>
delay in ms
</th>
<th>
processor load
</th>
<th>
partition size
</th>
<th>
number of partitions
</th>
</tr>
<tr>
<td>
3 ms
</td>
<td>
60%
</td>
<td>
64 samples
</td>
<td>
128
</td>
</tr>
<tr>
<td>
6 ms
</td>
<td>
30%
</td>
<td>
128 samples
</td>
<td>
64
</td>
</tr>
<tr>
<td>
12 ms
</td>
<td>
16%
</td>
<td>
256 samples
</td>
<td>
32
</td>
</tr>
<tr>
<td>
24 ms
</td>
<td>
11%
</td>
<td>
512 samples
</td>
<td>
16
</td>
</tr>
<tr>
<td>
47 ms
</td>
<td>
8%
</td>
<td>
1024 samples
</td>
<td>
8
</td>
</tr>
</table>
<p>
As seen in the table, BruteFIR allows for as low delay as 3
milliseconds, which is the limit of the sound card used, which cannot have
shorter than 64 sample partitions.
<p>
If you want to run BruteFIR to
achieve high throughput, you should expect to have a delay of at least
100 ms though (and using no more than 16 partitions or so).
<p>
If you try to run BruteFIR with shorter delay than the computer can
handle, or with too long filters, the program will exit with a broken
pipe signal. If you get broken pipe only after a while, this is
probably due to that you have not applied a good low latency patch to
the kernel (there are bad ones as well), or you have cron jobs running
or other software that competes for using the processor. For
reasonable low latency, a low latency kernel can handle other
processes running, but for as low as 3 milliseconds like in this
example, you should have a dedicated clean system for running BruteFIR.
<h2 id="hardware">Hardware considerations</h2>
<p>
<strong>Note:</strong> the hardware referenced here is a bit dated (a
long time ago the text was written), but apart from that, the text is
up to date.
<p>
What is important for BruteFIR is that the machine has fast memory and
fast processor. A Pentium 4 with its RDRAM is probably the best choice
today. However, an Athlon with DDR RAM is not bad either, and
significantly cheaper. A fast processor on a computer with slow memory
is what most often causes disappointment. For example, a dual Pentium
III at 1 GHz with good use of both processors was found to be slower
than a single processor 1 GHz AMD Athlon with DDR RAM. The problem was
that the Pentium III had poor memory performance. The stream benchmark <a
href="brutefir.html#stream">[20]</a> is a good program to use to verify
the memory bandwidth if you think you get poor BruteFIR performance.
<p>
If you use SDRAM you will never get exceptional memory
bandwidth, however, some tuning of timer settings in the BIOS, or
overclocking of the memory bus can give you quite decent performance.
<p>
When it comes to sound hardware, you should be able to use any card
that is compatible with ALSA <a
href="brutefir.html#alsa">[2]</a>. However, it is not very likely that
the sound card code of BruteFIR will work for all sound cards
supported by ALSA, although that is the goal. If you get problems with
your sound card, please send me a mail, and I will do my best to get
it to work, or even better, try to get it to work yourself and send me
a patch.
<p>