Add strict checking for netlink KRT dumps to avoid PMTU cache records
from FNHE table dump along with KRT.
Linux Kernel added FNHE table dump to the netlink API in patch:
8d3b68cd37.1561131177.git.sbrivio@redhat.com/
Therefore, since Linux 5.3 these route cache entries are dumped together
with regular routes during periodic KRT scans, which in some cases may be
huge amount of useless data. This can be avoided by using strict checking
for netlink dumps:
https://lore.kernel.org/netdev/20181008031644.15989-1-dsahern@kernel.org/
The patch mitigates the risk of receiving unknown and potentially large
number of FNHE records that would block BIRD I/O in each sync. There is a
known issue caused by the GRE tunnels on Linux that seems to be creating
one FNHE record for each destination IP address that is routed through
the tunnel, even when the PMTU equals to GRE interface MTU.
Thanks to Tomas Hlavacek for the original patch.
Kernel uses cloned routes to keep route cache entries, but reports them
together with regular routes. They were skipped implicitly as they
do not have rtm_protocol filled. Add explicit check for cloned flag
and skip such routes explicitly.
Also, improve debug logs of skipped routes.
Add option to socket interface for nonlocal binding, i.e. binding to an
IP address that is not present on interfaces. This behaviour is enabled
when SKF_FREEBIND socket flag is set. For Linux systems, it is
implemented by IP_FREEBIND socket flag.
Minor changes done by commiter.
Currently, BIRD ignores dead routes to consider them absent. But it also
ignores its own routes and thus it can not correctly manage such routes
in some cases. This patch makes an exception for routes with proto bird
when ignoring dead routes, so they can be properly updated or removed.
Thanks to Alexander Zubkov for the original patch.
Lexer expression for bytestring was too loose, accepting also
full-length IPv6 addresses. It should be restricted such that
colon is used between every byte or never.
Fix the regex and also add some test cases for it.
Thanks to Alexander Zubkov for the bugreport
Add operators .min and .max to find minumum or maximum element in sets
of types: clist, eclist, lclist. Example usage:
bgp_community.min
bgp_ext_community.max
filter(bgp_large_community, [(as1, as2, *)]).min
Signed-off-by: Alexander Zubkov <green@qrator.net>
The BSD kernel does not support the onlink flag and BIRD does not use
direct routes for next hop validation, instead depends on interface
address ranges. We would like to handle PtMP cases with only host
addresses configured, like:
ifconfig wg0 192.168.0.10/32
route add 192.168.0.4 -iface wg0
route add 192.168.0.8 -iface wg0
To accept BIRD routes with onlink next-hop, like:
route 192.168.42.0/24 via 192.168.0.4%wg0 onlink
BIRD would dismiss the route when receiving from the kernel, as the
next-hop 192.168.0.4 is not part of any interface subnet and onlink
flag is not kept by the BSD kernel.
The commit fixes this by assuming that for routes received from the
kernel, any next-hop is onlink on ifaces with only host addresses.
Thanks to Stefan Haller for the original patch.
RFC 6810 and RFC 8210 specify that the "Max Length" value MUST NOT be
less than the Prefix Length element (underflow). On the other side,
overflow of the Max Length element also is possible, it being an 8-bit
unsigned integer allows for values larger than 32 or 128. This also
implicitly ensures there is no overflow of "Length" value.
When a PDU is received where the Max Length field is corrputed, the RTR
client (BIRD) should immediately terminate the session, flush all data
learned from that cache, and log an error for the operator.
Minor changes done by commiter.
Compare all IA_* flags that are set by sysdep iface code.
The old code ignores IA_SECONDARY flag when comparing whether iface
address updates from kernel changed anything. This is usually not an
issue as kernel removes all secondary addresses due to removal of the
primary one, but it breaks when sysctl 'promote_secondaries' is enabled
and kernel promotes secondary addresses to primary ones.
Thanks to 'Alexander' for the bugreport.
For convenience, Trie functions generally accept as input values not only
NET_IPx types of nets, but also NET_VPNx and NET_ROAx types. But returned
values are always NET_IPx types.
This feature is intended mostly for checking that BIRD's allocation
strategies don't consume much memory space. There are some cases where
withdrawing routes in a specific order lead to memory fragmentation and
this output should give the user at least a notion of how much memory is
actually used for data storage and how much memory is "just allocated"
or used for overhead.
Also raising the "system allocator overhead estimation" from 8 to 16
bytes; it is probably even more. I've found 16 as a local minimum in
best scenarios among reachable machines. I couldn't find any reasonable
method to estimate this value when BIRD starts up.
This commit also fixes the inaccurate computation of memory overhead for
slabs where the "system allocater overhead estimation" was improperly
added to the size of mmap-ed memory.
The prefix trie now supports longest-prefix-match query by function
trie_match_longest_ipX() and it can be extended to iteration over all
covering prefixes for a given prefix (from longest to shortest) using
TRIE_WALK_TO_ROOT_IPx() macro.
Trie walking allows enumeration of prefixes in a trie in the usual
lexicographic order. Optionally, trie enumeration can be restricted
to a chosen subnet (and its descendants).
BIRD implements shutdown by reconfiguring to fake empty configuration.
Such fake config structure is created from the last running config and
shares some data, including symbol table. This allows access to (removed)
routing tables and causes crash when 'show route' command is used during
shutdown.
Clean up symbol table, table list and links to default tables, so removed
routing tables cannot be accessed during shutdown.
Add trie tests intended as benchmarks that use external datasets
instead of generated prefixes. As datasets are not included, they
are commented out by default.
Use 16-way (4bit) branching in prefix trie instead of basic binary
branching. The change makes IPv4 prefix sets almost 3x faster, but
with more memory consumption and much more complicated algorithm.
Together with a previous filter change, it makes IPv4 prefix sets
about ~4.3x faster and slightly smaller (on my test data).
Pipes copy the original rte with old values, so they require rte to be
exported with stored tmpattrs. Other protocols access stored attributes
using eattr list, so they require rte to be exported with expanded
tmpattrs. This is temporary hack, we plan to remove whoe tmpattr mechanism.
Thanks to Paul Donohue for the bugreport.
In most cases of export there is no need to store back temporary
attributes to rte, as receivers (protocols) access eattr list anyway.
But pipe copies the original rte with old values, so we should store
tmpattrs also during export.
Thanks to Paul Donohue for the bugreport.
Some cleanups and bugfixes to the previous patch, including:
- Fix rate limiting in index mismatch check
- Fix missing BABEL_AUTH_INDEX_LEN in auth_tx_overhead computation
- Fix missing auth_tx_overhead recalculation during reconfiguration
- Fix pseudoheader construction in babel_auth_sign() (sport vs fport)
- Fix typecasts for ptrdiffs in log messages
- Make auth log messages similar to corresponding RIP/OSPF ones
- Change auth log messages for events that happen during regular
operation to debug messages
- Switch meaning of babel_auth_check*() functions for consistency
with corresponding RIP/OSPF ones
- Remove requirement for min/max key length, only those required by
given MAC code are enforced
This implements support for MAC authentication in the Babel protocol, as
specified by RFC 8967. The implementation seeks to follow the RFC as close
as possible, with the only deliberate deviation being the addition of
support for all the HMAC algorithms already supported by Bird, as well as
the Blake2b variant of the Blake algorithm.
For description of applicability, assumptions and security properties,
see RFC 8967 sections 1.1 and 1.2.
In preparation for adding authentication checks, refactor the TLV
walking code so it can be reused for a separate pass of the packet
for authentication checks.
Add support for specifying a password in hexadecimal format, The result
is the same whether a password is specified as a quoted string or a
hex-encoded byte string, this just makes it more convenient to input
high-entropy byte strings as MAC keys.
Import the blake2-kat.h header with test vector output from the blake
reference implementation, and add tests to mac_test.c to compare the
output of the Bird MAC algorithm implementations with that reference
output.
Since the reference implementation only has test vectors for the full
output size, there are no tests for the smaller-sized output variants.
The Babel MAC authentication RFC recommends implementing Blake2s as one of
the supported algorithms. In order to achieve do this, add the blake2b and
blake2s hash functions for MAC authentication. The hashing function
implementations are the reference implementations from blake2.net.
The Blake2 algorithms allow specifying an arbitrary output size, and the
Babel MAC spec says to implement Blake2s with 128-bit output. To satisfy
this, we add two different variants of each of the algorithms, one using
the default size (256 bits for Blake2s, 512 bits for Blake2b), and one
using half the default output size.
Update to BIRD coding style done by committer.
Add a wrapper function in sysdep to get random bytes, and required checks
in configure.ac to select how to do it. The configure script tries, in
order, getrandom(), getentropy() and reading from /dev/urandom.
Routes from downed protocols stay in rtable (until next rtable prune
cycle ends) and may be even exported to another protocol. In BGP case,
source BGP protocol is examined, although dynamic parts (including
neighbor entries) are already freed. That may lead to crash under some
race conditions. Ensure that freed neighbor entry is not accessed to
avoid this issue.
When an interface disappears, all the neighbors are freed as well. Seqno
requests were anyway not decoupled from them, leading to strange
segfaults. This fix adds a proper seqno request list inside neighbors to
make sure that no pointer to neighbor is kept after free.