Many of today's desktop systems and servers come with on board gigabit network controllers. After some simple speeds tests you will soon find out that you are not be able to transfer data over the network much faster than you did with a 100MB link. There are many factors which affect network performance including hardware, operating systems and network stack options. The purpose of this page is to explain how you can achieve up to 930 megabits per second transfer rates over a gigabit link using OpenBSD as a firewall or transparent bridge.
It is important to remember that you can not expect to reach gigabit speeds using slow hardware or an unoptimized firewall rule set. Speed and efficiency are key to our goal. Lets start with the most important aspect of the equation, hardware.
No matter what operating system you choose, the machine you run on will determine the theoretical speed limit you can expect to achieve. When people talk about how fast a system is they always mention CPU clock speed. We would expect an AMD64 2.4GHz to run faster than a Pentium3 1.0 GHz, but CPU speed is not the key, motherboard bus speed is.
In terms of a firewall or bridge we are looking to move data through the system as fast as possible. This means we need to have a PCI bus that is able to move data quickly between network interfaces. To do this the machine must have a wide bus and high bus speed. CPU clock speed is a very minor part of the equation.
The quality of a network card is key to high though put. As a very general rule, using the on-board network card is going to be much slower than an add in PCI card. The reason is that most desktop motherboard manufacturers use cheap on-board network chip sets that use CPU processing time instead of handling TCP traffic by themselves. This leads to very slow network performance and high CPU load.
A gigabit network controller built on board using the CPU will slow the entire system down. More than likely the system will not even be able to sustain 100MB speeds while also pegging the CPU at 100%. A network controller that is able to negotiate as a gigabit is _very_ different from a controller that can transfer a gigabit of data per second.
Ideally you want to use a server based add on card with a TCP offload engine or TCP accelerator. We have seen very good speeds with the Intel Pro/1000 MT series (em4) cards. They are not too expensive and all OS's have support.
Not to say that all on-board chip sets are bad. Supermicro server boards use an Intel 82546EB Gigabit Ethernet Controller on their server motherboards. It offers two(2) copper gigabit ports through a single chip set offering a 133MHz PCI-X, 128 bit wide bus, pre-fetching up to 64 packet descriptors and has two 64 KB on-chip packet buffers. This is an exceptionally fast chip and it saves space by being built onto the server board.
Now, in order to move data in and out of the network cards as fast as possible we need a bus with a wide bit rate and high clock speed. For example, a PCI-X 64bit slot is wider than a PCI-X 32bit as is a 66MHz bus is faster than a 33MHz bus. Wide is good, fast is good, but wide and fast are better.
The equation to calculate the theoretical speed of a PCI or PCI-X slot is the following:
(bus speed in MHz) * (bus width in bits) / 8 = speed in Megabytes/second 66 MHz * 32 bit / 8 = 264 Megabytes/second
For example, if we have a motherboard with a 32bit wide bus running at 66MHz then the theoretical max speed we can push data through the slot is 66*32/8= 264 Megabytes/second. With a server class board we could use a 64bit slot running at 133MHz and reach speeds of 133*64/8= 1064 Megabytes/second.
Now that you have the max speed of the single PCI slot we need to understand this number represents the max speed of the bus if nothing else is using the PCI bus. Since all PCI cards and built on-board chips use the same bus then they must also be taken into account. If we have two network cards each using a 64bit, 133MHz slot then each slot will get to use 50% of the total speed of the PCI bus. Each card can do 133*64/8= 1064 Megabytes/second and if both network cards are being used at once, like on a firewall, then each card can use 1064/2= 532 Megabytes/second max. This is still well above the maximum speed of a gigabit connection which can move 1000/8= 128 Megabytes/second.
Look at the specifications or motherboard you expect to use and the above equation to get a rough idea of the speeds you can expect out of the box. Hardware speed is the key to a fast firewall. Before setting up your new system and possibly wasting hours wondering why it is not reaching your speed goals, make sure you understand the limitations of the hardware. Do not expect throughput out of your system hardware that it is _not_ capable of.
For example, when using a four port network card on a machine, consider the bandwidth of the adapter slot you put it into. Standard PCI is a 32 bit wide interface and the bus speed is 66MHz or 133 MHz. This bandwidth is shared across all devices on the same bus. PCI-e is a serial connection with 2.5 GHz frequency in both directions for a 1x slot. The effective maximum bandwidth is 2Gbps bidirectional. So, if you decide to support 4, 1Gbps connections on one card it might be best to do it with a PCI-e 4x or faster slot and card.
For a standard OpenBSD firewall one(1) gigabyte of ram is more than enough. In fact, unless you are running many memory hungry services you will actually use less than 100 megabytes of ram at any one time. On our testing system we had eight(8) gig available, but OpenBSD will only recognize 3.1 gig of that no matter if you use the i386 or AMD64 kernel. One of the few times you may need more ram is if your firewall is going to load tables in Pf with tens of thousands of entries. These days ram is cheap, but there is no need to put four(4) to eight(8) gigabytes in the machine as it will only go to waste.
It is sometimes recommend to set the MTU of your network interface over a default value of 1500. Users of jumbo frames can set the MTU as high as 9000. The MTU value tells the network card to send a Ethernet frame of the value specified in bytes. While this may be useful when connecting two hosts directly together using the same MTU, it is a lot less useful when connecting through a switch which does not support a larger MTU.
When a switch or a machine receives a MTU that is larger then they are able to forward they must fragment the packets. This takes time and is very inefficient. The throughput you may gain when connecting to similar high MTU machines you will loose when connecting to any 1500 MTU machine.
Either way, increasing the MTU is may not be necessary. 930Mb/s can be attained at the normal 1500 byte MTU setting with the following network tweaks.
First, make sure you are running the latest version of OpenBSD. Not necessarily the bleeding edge -current tree, the -stable tree will work just fine. As of OpenBSD v4.5 there have been a lot of work done to remove many of the bottlenecks in the network code and how Pf handles traffic.
Second, make sure you have applied any patches to the system according to the OpenBSD page. We have a patch guide if you need it, Patching OpenBSD kernel and packages.
The following options are put in the /etc/sysctl.conf file. They will increase the network buffer sizes and allow TCP window scaling. Understand that these settings are at the upper extreme. We found them perfectly suited in a production environment which can saturate a gigabit link. You may not need to set each of the values this high, but that is up to your environment and testing methods. Summery explanations of each line follow each option.
### Calomel.org OpenBSD /etc/sysctl.conf ## ddb.panic=0 # do not enter ddb consol on kernel panic, reboot if possible net.inet.ip.forwarding=1 # Permit forwarding (routing) of packets net.inet.ip.ifq.maxlen=512 # Maximum allowed input queue length (256*number of interfaces) net.inet.icmp.errppslimit=1000 # Maximum number of outgoing ICMP error messages per second net.inet.ip.ttl=254 # the TTL should match what we have for "min-ttl" in scrub rule in pf.conf net.inet.tcp.ackonpush=1 # acks for packets with the push bit set should not be delayed net.inet.tcp.ecn=1 # Explicit Congestion Notification enabled net.inet.tcp.mssdflt=1452 # maximum segment size (1452 from scrub pf.conf) net.inet.tcp.rfc1323=1 # RFC1323 TCP window scaling net.inet.tcp.recvspace=262144 # Increase TCP "receive" windows size to increase performance net.inet.tcp.sendspace=262144 # Increase TCP "send" windows size to increase performance net.inet.tcp.sack=1 # enable TCP Selective ACK (SACK) Packet Recovery net.inet.udp.recvspace=262144 # Increase UDP "receive" windows size to increase performance net.inet.udp.sendspace=262144 # Increase UDP "send" windows size to increase performance kern.maxclusters=128000 # Cluster allocation limit vm.swapencrypt.enable=1 # encrypt pages that go to swap ### CARP options if needed # net.inet.carp.arpbalance=0 # CARP load-balance # net.inet.carp.log=2 # Log CARP state changes # net.inet.carp.preempt=1 # Enable CARP interfaces to preempt each other (0 -> 1) # net.inet.ip.forwarding=1 # Enable packet forwarding through the firewall (0 -> 1)
You can apply each of these settings manually by using sysctl on the command line. For example, "sysctl kern.maxclusters=128000" will set the kern.maxclusters variable until the machine is rebooted. By setting the variables manually you can test each of them to see if they will help your machine.
Continuing with OpenBSD v4.5, a lot of work has been done on the single and multi-core kernels focused on speed and efficiency improvements. Since many OpenBSD machines will be used as a firewall or bridge we wanted to see what type of speeds we could expect passing through the machine. Lets take a look at the single and multi core kernel, the effects of using PF enabled or disabled and the effect of the our "speed tweaks" listed in the section above.
The testing hardware
To do our testing we will use the latest patches applied to the latest distribution. Our test setup consists of two(2) identical boxes containing an Intel Core 2 Quad (Q9300), eight(8) gigs of ram and an Intel PRO/1000 MT (CAT5e copper) network card. The cards were put in a 64bit PCI-X slot running at 133 MHz. The boxes are connected to each other by an Extreme Networks Summit X450a-48t gigabit switch using 12' unshielded CAT6 cable.
The testing software
The following iperf options were used on the machines we will call test0 and test1. We will be sustaining a full speed transfer for 30 seconds and take the average speed in Mbits/sec as the result. Iperf is available through the OpenBSD repositories using "pkg_add iperf".## iperf listening server root@test1: iperf -s ## iperf sending client root@test0: iperf -i 1 -t 30 -c test1
The PF rules
The following minimal PF rules were used if PF was enabled (pf=YES)# pfctl -sr scrub in all fragment reassemble pass in all flags S/SA keep state block drop in on ! lo0 proto tcp from any to any port = 6000
Test 1: No Speed Tweaks. Using the GENERIC and GENERIC.MP kernel (patched -stable) with the default tcp window sizes we are able to sustain over 300 Mbits/sec (37 Megabytes/sec). Since the link was at gigabit (1000 Mbits/sec maximum) we are using less then 40% of our network line speed.
bsd.single_processor_patched pf=YES speed_tweaks=NO [ 1] 0.0-30.0 sec 1.10 GBytes 315 Mbits/sec bsd.single_processor_patched pf=NO speed_tweaks=NO [ 1] 0.0-30.0 sec 1.24 GBytes 356 Mbits/sec bsd.multi_processor_patched pf=YES speed_tweaks=NO [ 4] 0.0-30.2 sec 1.13 GBytes 321 Mbits/sec bsd.multi_processor_patched pf=NO speed_tweaks=NO [ 4] 0.0-30.0 sec 1.28 GBytes 368 Mbits/sec
According to the results the network utilization was quite poor. We are able to push data across the network at less than half of its capacity (Gigabit=1000Mbit/s and we used 368Mbit/s or 36%). For most uses on a home network with a cable modem or FIOS you will not notice. But, what if you have access to a high speed gigabit or 10 gigabit network?
Test 2: Calomel.org Speed Tweaks. Using the GENERIC and GENERIC.MP (patched -stable) kernel we are able to sustain around 800 Mbits/sec, almost three(3) times the default speeds.
bsd.single_processor_patched pf=YES speed_tweaks=YES [ 1] 0.0-30.0 sec 2.95 GBytes 845 Mbits/sec bsd.single_processor_patched pf=NO speed_tweaks=YES [ 1] 0.0-30.0 sec 3.25 GBytes 868 Mbits/sec bsd.multi_processor_patched pf=YES speed_tweaks=YES [ 4] 0.0-30.0 sec 2.69 GBytes 772 Mbits/sec bsd.multi_processor_patched pf=NO speed_tweaks=YES [ 4] 0.0-30.2 sec 2.82 GBytes 803 Mbits/sec
These results are much better. We are utilizing more than 80% of a gigabit network. This means we can sustain over 100 megabytes per second on our network. Both the single processors and multi processor kernels performed efficiently. The use of PF reduced our throughput only minimally.
As of OpenBSD v4.5 you are welcome to use either one. Both kernels performed exceptionally well in our speeds tests.
Despite the recent development of multiple processors support in the OpenBSD, the kernel still operates as if were running on a single processor system. On a SMP system only one processor is able to run the kernel at any point in time, a semantic which is enforced by a Big Giant Lock. The Big Giant Lock (BGL) works like a token. If the kernel is being run under one CPU then it has the BGL and thus the kernel can _not_ be run on a second cpu. The network stack and thusly pf and pfsync run in the kernel and so under the Big Giant Lock.
If you have access to a multi core machine and are expecting to use programs that will take advantage of the cores then the multi core board is a good choice. PF is _not_ a multi core program so it will not benefit from multi core kernel. The advantages you will see if when other programs as well as PF need to run in parallel. For example an intrusion detection app, monitoring script or real time network reporting tool.
The next few sections are going to be dedicated to different operating systems Other then OpenBSD. Each OS has some way in which you can increase the overall throughput of the system. Just scroll to the OS you are most interested in.
### Calomel.org FreeBSD /etc/sysctl.conf ## kern.ipc.maxsockbuf=262144 # Maximum window size net.inet.tcp.sendspace=65536 # Increase TCP windows size to increase performance net.inet.tcp.recvspace=65536 # " net.inet.tcp.rfc1323=1 # RFC1323 TCP window scaling kern.ipc.nmbclusters=32768 # Buffers
### Calomel.org RedHat or CentOS Linux /etc/sysctl.conf ## # some of the defaults may be different for your kernel call this file with # sysctl -p
these are just suggested values that worked well to # increase throughput in several network benchmark tests, ### IPV4 specific settings # turns TCP timestamp support off, default 1, reduces CPU use net.ipv4.tcp_timestamps = 0 # turn SACK support on -- you probably want this off for 10GigE net.ipv4.tcp_sack = 1 # scaling support net.ipv4.tcp_window_scaling=1 # on systems with a VERY fast bus to memory interface this is the big plus # sets min/default/max TCP read buffer, default 4096 87380 174760 # setting to 100M - 10M is too small for cross country (chsmall) net.ipv4.tcp_rmem = 1000000 1000000 1000000 # sets min/pressure/max TCP write buffer, default 4096 16384 131072 net.ipv4.tcp_wmem = 1000000 1000000 1000000 # sets min/pressure/max TCP buffer space, default 31744 32256 32768 net.ipv4.tcp_mem = 150000000 150000000 150000000 ### CORE settings (for socket and UDP effect) # maximum receive socket buffer size, default 131071 net.core.rmem_max = 1000000 # maximum send socket buffer size, default 131071 net.core.wmem_max = 1000000 # default receive socket buffer size, default 65535 net.core.rmem_default = 2524287 # default send socket buffer size, default 65535 net.core.wmem_default = 2524287 # maximum amount of option memory buffers, default 10240 net.core.optmem_max = 2524287 # number of unprocessed input packets before kernel starts dropping them, default 300 net.core.netdev_max_backlog = 300000 # enable window scaling RFC1323 TCP window scaling net.ipv4.tcp_window_scaling=1
### Calomel.org Suse or openSUSE Linux /etc/sysctl.conf ## # some of the defaults may be different for your kernel call this file with # sysctl -p
these are just suggested values that worked well to # increase throughput in several network benchmark tests, # packet reordering in a network can be interpreted as packet loss # and increasing the value of this parameter should improve performance net.ipv4.tcp_reordering = 20 # Sets the Maximum Socket Send Buffer for TCP Protocol net.ipv4.tcp_wmem = 8192 87380 16777216 # Sets the Maximum Socket Receive Buffer for TCP Protocol net.ipv4.tcp_rmem = 8192 87380 16777216 # Enables/Disables the behavior of cache performance characteristics connection net.ipv4.tcp_no_metrics_save = 1 # You can set this to one of the manu available high speed congestion variants like "cubic" or "hs-tcp" net.ipv4.tcp_congestion_control = cubic # sets the Maximum Socket Send Buffer for all protocols net.core.wmem_max = 16777216 # Sets the Maximum Socket Receive Buffer for all protocols net.core.rmem_max = 16777216
Edit the registry using "regedit" and look for the following section:
Now add the following values:
Finally, one last note for Windows XP users: When you install Service Pack 2 (SP2), make sure to disable "Internet Connection Sharing". This is a major network slow down and by disabling it you should fix this performance problem. Also make sure you turn off or remove QOS in the TCP/IP Network settings.
How can I find performance bottlenecks and display real time statistics about the firewall hardware?
On any Unix based system run the command "systat vmstat" to give you a top like display of memory totals, paging amount, swap numbers, interrupts per second and much more. Systat is incredible useful to determine where the performance bottleneck is on a machine.
Questions, comments, or suggestions? Contact Calomel.org