diff --git a/doc/bird.sgml b/doc/bird.sgml index e9c61526..3ea90920 100644 --- a/doc/bird.sgml +++ b/doc/bird.sgml @@ -157,6 +157,9 @@ options. The most important ones are: -f run bird in foreground. + + -R + apply graceful restart recovery after start.

BIRD writes messages about its work to log files or syslog (according to config). @@ -187,6 +190,7 @@ configuration, but it is generally easy -- BIRD needs just the standard library, privileges to read the config file and create the control socket and the CAP_NET_* capabilities. + About routing tables

BIRD has one or more routing tables which may or may not be @@ -242,6 +246,20 @@ using comparison and ordering). Minor advantage is that routes are shown sorted in Graceful restart + +

When BIRD is started after restart or crash, it repopulates routing tables in +an uncoordinated manner, like after clean start. This may be impractical in some +cases, because if the forwarding plane (i.e. kernel routing tables) remains +intact, then its synchronization with BIRD would temporarily disrupt packet +forwarding until protocols converge. Graceful restart is a mechanism that could +help with this issue. Generally, it works by starting protocols and letting them +repopulate routing tables while deferring route propagation until protocols +acknowledge their convergence. Note that graceful restart behavior have to be +configured for all relevant protocols and requires protocol-specific support +(currently implemented for Kernel and BGP protocols), it is activated for +particular boot by option Configuration @@ -371,6 +389,12 @@ protocol rip { would accept IPv6 routes only). Such behavior was default in older versions of BIRD. + graceful restart wait + During graceful restart recovery, BIRD waits for convergence of routing + protocols. This option allows to specify a timeout for the recovery to + prevent waiting indefinitely if some protocols cannot converge. Default: + 240 seconds. + timeformat route|protocol|base|log " This option allows to specify a format of date/time used by BIRD. The first argument specifies for which purpose such @@ -1493,6 +1517,8 @@ extended communities (RFC 4360), route reflectors (RFC 4456), +graceful restart +(RFC 4724), multiprotocol extensions (RFC 4760), 4B AS numbers @@ -1502,9 +1528,7 @@ and 4B AS numbers in extended communities For IPv6, it uses the standard multiprotocol extensions defined in -RFC 2283 -including changes described in the -latest draft +RFC 4760 and applied to IPv6 according to RFC 2545. @@ -1716,6 +1740,26 @@ for each neighbor using the following configuration parameters: capability and accepts such requests. Even when disabled, BIRD can send route refresh requests. Default: on. + graceful restart + When a BGP speaker restarts or crashes, neighbors will discard all + received paths from the speaker, which disrupts packet forwarding even + when the forwarding plane of the speaker remains intact. RFC 4724 + specifies an optional graceful restart mechanism to alleviate this + issue. This option controls the mechanism. It has three states: + Disabled, when no support is provided. Aware, when the graceful restart + support is announced and the support for restarting neighbors is + provided, but no local graceful restart is allowed (i.e. receiving-only + role). Enabled, when the full graceful restart support is provided + (i.e. both restarting and receiving role). Note that proper support for + local graceful restart requires also configuration of other protocols. + Default: aware. + + graceful restart time + The restart time is announced in the BGP graceful restart capability + and specifies how long the neighbor would wait for the BGP session to + re-establish after a restart before deleting stale routes. Default: + 120 seconds. + interpret communities RFC 1997 demands that BGP speaker should process well-known communities like no-export (65535, 65281) or no-advertise (65535, 65282). For @@ -2063,25 +2107,36 @@ overcome using another routing table and the pipe protocol. Configuration

- persist Tell BIRD to leave all its routes in the - routing tables when it exits (instead of cleaning them up). - scan time Time in seconds between two consecutive scans of the - kernel routing table. - learn Enable learning of routes added to the kernel - routing tables by other routing daemons or by the system administrator. - This is possible only on systems which support identification of route - authorship. + persist + Tell BIRD to leave all its routes in the routing tables when it exits + (instead of cleaning them up). - device routes Enable export of device - routes to the kernel routing table. By default, such routes - are rejected (with the exception of explicitly configured - device routes from the static protocol) regardless of the - export filter to protect device routes in kernel routing table - (managed by OS itself) from accidental overwriting or erasing. + scan time + Time in seconds between two consecutive scans of the kernel routing + table. - kernel table Select which kernel table should - this particular instance of the Kernel protocol work with. Available - only on systems supporting multiple routing tables. + learn + Enable learning of routes added to the kernel routing tables by other + routing daemons or by the system administrator. This is possible only on + systems which support identification of route authorship. + + device routes + Enable export of device routes to the kernel routing table. By default, + such routes are rejected (with the exception of explicitly configured + device routes from the static protocol) regardless of the export filter + to protect device routes in kernel routing table (managed by OS itself) + from accidental overwriting or erasing. + + kernel table + Select which kernel table should this particular instance of the Kernel + protocol work with. Available only on systems supporting multiple + routing tables. + + graceful restart + Participate in graceful restart recovery. If this option is enabled and + a graceful restart recovery is active, the Kernel protocol will defer + synchronization of routing tables until the end of the recovery. Note + that import of kernel routes to BIRD is not affected. Attributes diff --git a/nest/proto.c b/nest/proto.c index 2bc3e319..e990b48f 100644 --- a/nest/proto.c +++ b/nest/proto.c @@ -51,6 +51,8 @@ static char *c_states[] = { "HUNGRY", "???", "HAPPY", "FLUSHING" }; static void proto_flush_loop(void *); static void proto_shutdown_loop(struct timer *); static void proto_rethink_goal(struct proto *p); +static void proto_want_export_up(struct proto *p); +static void proto_fell_down(struct proto *p); static char *proto_state_name(struct proto *p); static void @@ -151,21 +153,20 @@ extern pool *rt_table_pool; * @t: routing table to connect to * @stats: per-table protocol statistics * - * This function creates a connection between the protocol instance @p - * and the routing table @t, making the protocol hear all changes in - * the table. + * This function creates a connection between the protocol instance @p and the + * routing table @t, making the protocol hear all changes in the table. * - * The announce hook is linked in the protocol ahook list and, if the - * protocol accepts routes, also in the table ahook list. Announce - * hooks are allocated from the routing table resource pool, they are - * unlinked from the table ahook list after the protocol went down, - * (in proto_schedule_flush()) and they are automatically freed after the - * protocol is flushed (in proto_fell_down()). + * The announce hook is linked in the protocol ahook list. Announce hooks are + * allocated from the routing table resource pool and when protocol accepts + * routes also in the table ahook list. The are linked to the table ahook list + * and unlinked from it depending on export_state (in proto_want_export_up() and + * proto_want_export_down()) and they are automatically freed after the protocol + * is flushed (in proto_fell_down()). * - * Unless you want to listen to multiple routing tables (as the Pipe - * protocol does), you needn't to worry about this function since the - * connection to the protocol's primary routing table is initialized - * automatically by the core code. + * Unless you want to listen to multiple routing tables (as the Pipe protocol + * does), you needn't to worry about this function since the connection to the + * protocol's primary routing table is initialized automatically by the core + * code. */ struct announce_hook * proto_add_announce_hook(struct proto *p, struct rtable *t, struct proto_stats *stats) @@ -183,7 +184,7 @@ proto_add_announce_hook(struct proto *p, struct rtable *t, struct proto_stats *s h->next = p->ahooks; p->ahooks = h; - if (p->rt_notify && (p->export_state == ES_READY)) + if (p->rt_notify && (p->export_state != ES_DOWN)) add_tail(&t->hooks, &h->n); return h; } @@ -659,16 +660,59 @@ proto_rethink_goal(struct proto *p) } +/** + * DOC: Graceful restart recovery + * + * Graceful restart of a router is a process when the routing plane (e.g. BIRD) + * restarts but both the forwarding plane (e.g kernel routing table) and routing + * neighbors keep proper routes, and therefore uninterrupted packet forwarding + * is maintained. + * + * BIRD implements graceful restart recovery by deferring export of routes to + * protocols until routing tables are refilled with the expected content. After + * start, protocols generate routes as usual, but routes are not propagated to + * them, until protocols report that they generated all routes. After that, + * graceful restart recovery is finished and the export (and the initial feed) + * to protocols is enabled. + * + * When graceful restart recovery need is detected during initialization, then + * enabled protocols are marked with @gr_recovery flag before start. Such + * protocols then decide how to proceed with graceful restart, participation is + * voluntary. Protocols could lock the recovery by proto_graceful_restart_lock() + * (stored in @gr_lock flag), which means that they want to postpone the end of + * the recovery until they converge and then unlock it. They also could set + * @gr_wait before advancing to %PS_UP, which means that the core should defer + * route export to that protocol until the end of the recovery. This should be + * done by protocols that expect their neigbors to keep the proper routes + * (kernel table, BGP sessions with BGP graceful restart capability). + * + * The graceful restart recovery is finished when either all graceful restart + * locks are unlocked or when graceful restart wait timer fires. + * + */ -static void graceful_restart_done(struct timer *t UNUSED); -static void proto_want_export_up(struct proto *p); +static void graceful_restart_done(struct timer *t); +/** + * graceful_restart_recovery - request initial graceful restart recovery + * + * Called by the platform initialization code if the need for recovery + * after graceful restart is detected during boot. Have to be called + * before protos_commit(). + */ void graceful_restart_recovery(void) { graceful_restart_state = GRS_INIT; } +/** + * graceful_restart_init - initialize graceful restart + * + * When graceful restart recovery was requested, the function starts an active + * phase of the recovery and initializes graceful restart wait timer. The + * function have to be called after protos_commit(). + */ void graceful_restart_init(void) { @@ -689,6 +733,15 @@ graceful_restart_init(void) tm_start(gr_wait_timer, config->gr_wait); } +/** + * graceful_restart_done - finalize graceful restart + * + * When there are no locks on graceful restart, the functions finalizes the + * graceful restart recovery. Protocols postponing route export until the end of + * the recovery are awakened and the export to them is enabled. All other + * related state is cleared. The function is also called when the graceful + * restart wait timer fires (but there are still some locks). + */ static void graceful_restart_done(struct timer *t UNUSED) { @@ -727,7 +780,19 @@ graceful_restart_show_status(void) cli_msg(-24, " Wait timer is %d/%d", tm_remains(gr_wait_timer), config->gr_wait); } -/* Just from start hook */ +/** + * proto_graceful_restart_lock - lock graceful restart by protocol + * @p: protocol instance + * + * This function allows a protocol to postpone the end of graceful restart + * recovery until it converges. The lock is removed when the protocol calls + * proto_graceful_restart_unlock() or when the protocol is stopped. + * + * The function have to be called during the initial phase of graceful restart + * recovery and only for protocols that are part of graceful restart (i.e. their + * @gr_recovery is set), which means it should be called from protocol start + * hooks. + */ void proto_graceful_restart_lock(struct proto *p) { @@ -741,6 +806,13 @@ proto_graceful_restart_lock(struct proto *p) graceful_restart_locks++; } +/** + * proto_graceful_restart_unlock - unlock graceful restart by protocol + * @p: protocol instance + * + * This function unlocks a lock from proto_graceful_restart_lock(). It is also + * automatically called when the lock holding protocol went down. + */ void proto_graceful_restart_unlock(struct proto *p) { @@ -867,29 +939,6 @@ protos_build(void) proto_flush_event->hook = proto_flush_loop; proto_shutdown_timer = tm_new(proto_pool); proto_shutdown_timer->hook = proto_shutdown_loop; - proto_shutdown_timer = tm_new(proto_pool); - proto_shutdown_timer->hook = proto_shutdown_loop; -} - -static void -proto_fell_down(struct proto *p) -{ - DBG("Protocol %s down\n", p->name); - - u32 all_routes = p->stats.imp_routes + p->stats.filt_routes; - if (all_routes != 0) - log(L_ERR "Protocol %s is down but still has %d routes", p->name, all_routes); - - bzero(&p->stats, sizeof(struct proto_stats)); - proto_free_ahooks(p); - - if (! p->proto->multitable) - rt_unlock_table(p->table); - - if (p->proto->cleanup) - p->proto->cleanup(p); - - proto_rethink_goal(p); } static void @@ -1066,6 +1115,10 @@ proto_request_feeding(struct proto *p) { ASSERT(p->proto_state == PS_UP); + /* Do nothing if we are still waiting for feeding */ + if (p->export_state == ES_DOWN) + return; + /* If we are already feeding, we want to restart it */ if (p->export_state == ES_FEEDING) { @@ -1220,6 +1273,27 @@ proto_falling_down(struct proto *p) proto_graceful_restart_unlock(p); } +static void +proto_fell_down(struct proto *p) +{ + DBG("Protocol %s down\n", p->name); + + u32 all_routes = p->stats.imp_routes + p->stats.filt_routes; + if (all_routes != 0) + log(L_ERR "Protocol %s is down but still has %d routes", p->name, all_routes); + + bzero(&p->stats, sizeof(struct proto_stats)); + proto_free_ahooks(p); + + if (! p->proto->multitable) + rt_unlock_table(p->table); + + if (p->proto->cleanup) + p->proto->cleanup(p); + + proto_rethink_goal(p); +} + /** * proto_notify_state - notify core about protocol state change diff --git a/nest/rt-table.c b/nest/rt-table.c index bc911729..4295f836 100644 --- a/nest/rt-table.c +++ b/nest/rt-table.c @@ -1110,6 +1110,21 @@ rt_examine(rtable *t, ip_addr prefix, int pxlen, struct proto *p, struct filter return v > 0; } + +/** + * rt_refresh_begin - start a refresh cycle + * @t: related routing table + * @ah: related announce hook + * + * This function starts a refresh cycle for given routing table and announce + * hook. The refresh cycle is a sequence where the protocol sends all its valid + * routes to the routing table (by rte_update()). After that, all protocol + * routes (more precisely routes with @ah as @sender) not sent during the + * refresh cycle but still in the table from the past are pruned. This is + * implemented by marking all related routes as stale by REF_STALE flag in + * rt_refresh_begin(), then marking all related stale routes with REF_DISCARD + * flag in rt_refresh_end() and then removing such routes in the prune loop. + */ void rt_refresh_begin(rtable *t, struct announce_hook *ah) { @@ -1126,6 +1141,14 @@ rt_refresh_begin(rtable *t, struct announce_hook *ah) FIB_WALK_END; } +/** + * rt_refresh_end - end a refresh cycle + * @t: related routing table + * @ah: related announce hook + * + * This function starts a refresh cycle for given routing table and announce + * hook. See rt_refresh_begin() for description of refresh cycles. + */ void rt_refresh_end(rtable *t, struct announce_hook *ah) { @@ -1405,6 +1428,19 @@ again: return 1; } +/** + * rt_prune_table - prune a routing table + * + * This function scans the routing table @tab and removes routes belonging to + * flushing protocols, discarded routes and also stale network entries, in a + * similar fashion like rt_prune_loop(). Returns 1 when all such routes are + * pruned. Contrary to rt_prune_loop(), this function is not a part of the + * protocol flushing loop, but it is called from rt_event() for just one routing + * table. + * + * Note that rt_prune_table() and rt_prune_loop() share (for each table) the + * prune state (@prune_state) and also the pruning iterator (@prune_fit). + */ static inline int rt_prune_table(rtable *tab) { @@ -1415,16 +1451,15 @@ rt_prune_table(rtable *tab) /** * rt_prune_loop - prune routing tables * - * The prune loop scans routing tables and removes routes belonging to - * flushing protocols and also stale network entries. Returns 1 when - * all such routes are pruned. It is a part of the protocol flushing - * loop. + * The prune loop scans routing tables and removes routes belonging to flushing + * protocols, discarded routes and also stale network entries. Returns 1 when + * all such routes are pruned. It is a part of the protocol flushing loop. * - * The prune loop runs in two steps. In the first step it prunes just - * the routes with flushing senders (in explicitly marked tables) so - * the route removal is propagated as usual. In the second step, all - * remaining relevant routes are removed. Ideally, there shouldn't be - * any, but it happens when pipe filters are changed. + * The prune loop runs in two steps. In the first step it prunes just the routes + * with flushing senders (in explicitly marked tables) so the route removal is + * propagated as usual. In the second step, all remaining relevant routes are + * removed. Ideally, there shouldn't be any, but it happens when pipe filters + * are changed. */ int rt_prune_loop(void) diff --git a/proto/bgp/bgp.c b/proto/bgp/bgp.c index ae9f6877..326883dd 100644 --- a/proto/bgp/bgp.c +++ b/proto/bgp/bgp.c @@ -51,6 +51,16 @@ * and bgp_encode_attrs() which does the converse. Both functions are built around a * @bgp_attr_table array describing all important characteristics of all known attributes. * Unknown transitive attributes are attached to the route as %EAF_TYPE_OPAQUE byte streams. + * + * BGP protocol implements graceful restart in both restarting (local restart) + * and receiving (neighbor restart) roles. The first is handled mostly by the + * graceful restart code in the nest, BGP protocol just handles capabilities, + * sets @gr_wait and locks graceful restart until end-of-RIB mark is received. + * The second is implemented by internal restart of the BGP state to %BS_IDLE + * and protocol state to %PS_START, but keeping the protocol up from the core + * point of view and therefore maintaining received routes. Routing table + * refresh cycle (rt_refresh_begin(), rt_refresh_end()) is used for removing + * stale routes after reestablishment of BGP session during graceful restart. */ #undef LOCAL_DEBUG @@ -431,6 +441,17 @@ bgp_conn_enter_idle_state(struct bgp_conn *conn) bgp_conn_leave_established_state(p); } +/** + * bgp_handle_graceful_restart - handle detected BGP graceful restart + * @p: BGP instance + * + * This function is called when a BGP graceful restart of the neighbor is + * detected (when the TCP connection fails or when a new TCP connection + * appears). The function activates processing of the restart - starts routing + * table refresh cycle and activates BGP restart timer. The protocol state goes + * back to %PS_START, but changing BGP state back to %BS_IDLE is left for the + * caller. + */ void bgp_handle_graceful_restart(struct bgp_proto *p) { @@ -448,6 +469,16 @@ bgp_handle_graceful_restart(struct bgp_proto *p) rt_refresh_begin(p->p.main_ahook->table, p->p.main_ahook); } +/** + * bgp_graceful_restart_done - finish active BGP graceful restart + * @p: BGP instance + * + * This function is called when the active BGP graceful restart of the neighbor + * should be finished - either successfully (the neighbor sends all paths and + * reports end-of-RIB on the new session) or unsuccessfully (the neighbor does + * not support BGP graceful restart on the new session). The function ends + * routing table refresh cycle and stops BGP restart timer. + */ void bgp_graceful_restart_done(struct bgp_proto *p) { @@ -457,6 +488,15 @@ bgp_graceful_restart_done(struct bgp_proto *p) rt_refresh_end(p->p.main_ahook->table, p->p.main_ahook); } +/** + * bgp_graceful_restart_timeout - timeout of graceful restart 'restart timer' + * @t: timer + * + * This function is a timeout hook for @gr_timer, implementing BGP restart time + * limit for reestablisment of the BGP session after the graceful restart. When + * fired, we just proceed with the usual protocol restart. + */ + static void bgp_graceful_restart_timeout(timer *t) { @@ -968,7 +1008,7 @@ bgp_start(struct proto *P) p->remote_id = 0; p->source_addr = p->cf->source_addr; - if (P->gr_recovery) + if (p->p.gr_recovery && p->cf->gr_mode) proto_graceful_restart_lock(P); /*