diff --git a/doc/bird.sgml b/doc/bird.sgml
index e9c61526..3ea90920 100644
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@@ -157,6 +157,9 @@ options. The most important ones are:
-f
run bird in foreground.
+
+ -R
+ apply graceful restart recovery after start.
BIRD writes messages about its work to log files or syslog (according to config).
@@ -187,6 +190,7 @@ configuration, but it is generally easy -- BIRD needs just the
standard library, privileges to read the config file and create the
control socket and the CAP_NET_* capabilities.
+
About routing tables
BIRD has one or more routing tables which may or may not be
@@ -242,6 +246,20 @@ using comparison and ordering). Minor advantage is that routes are
shown sorted in Graceful restart
+
+When BIRD is started after restart or crash, it repopulates routing tables in
+an uncoordinated manner, like after clean start. This may be impractical in some
+cases, because if the forwarding plane (i.e. kernel routing tables) remains
+intact, then its synchronization with BIRD would temporarily disrupt packet
+forwarding until protocols converge. Graceful restart is a mechanism that could
+help with this issue. Generally, it works by starting protocols and letting them
+repopulate routing tables while deferring route propagation until protocols
+acknowledge their convergence. Note that graceful restart behavior have to be
+configured for all relevant protocols and requires protocol-specific support
+(currently implemented for Kernel and BGP protocols), it is activated for
+particular boot by option Configuration
@@ -371,6 +389,12 @@ protocol rip {
would accept IPv6 routes only). Such behavior was default in
older versions of BIRD.
+ graceful restart wait
+ During graceful restart recovery, BIRD waits for convergence of routing
+ protocols. This option allows to specify a timeout for the recovery to
+ prevent waiting indefinitely if some protocols cannot converge. Default:
+ 240 seconds.
+
timeformat route|protocol|base|log "
This option allows to specify a format of date/time used by
BIRD. The first argument specifies for which purpose such
@@ -1493,6 +1517,8 @@ extended communities
(RFC 4360),
route reflectors
(RFC 4456),
+graceful restart
+(RFC 4724),
multiprotocol extensions
(RFC 4760),
4B AS numbers
@@ -1502,9 +1528,7 @@ and 4B AS numbers in extended communities
For IPv6, it uses the standard multiprotocol extensions defined in
-RFC 2283
-including changes described in the
-latest draft
+RFC 4760
and applied to IPv6 according to
RFC 2545.
@@ -1716,6 +1740,26 @@ for each neighbor using the following configuration parameters:
capability and accepts such requests. Even when disabled, BIRD
can send route refresh requests. Default: on.
+ graceful restart
+ When a BGP speaker restarts or crashes, neighbors will discard all
+ received paths from the speaker, which disrupts packet forwarding even
+ when the forwarding plane of the speaker remains intact. RFC 4724
+ specifies an optional graceful restart mechanism to alleviate this
+ issue. This option controls the mechanism. It has three states:
+ Disabled, when no support is provided. Aware, when the graceful restart
+ support is announced and the support for restarting neighbors is
+ provided, but no local graceful restart is allowed (i.e. receiving-only
+ role). Enabled, when the full graceful restart support is provided
+ (i.e. both restarting and receiving role). Note that proper support for
+ local graceful restart requires also configuration of other protocols.
+ Default: aware.
+
+ graceful restart time
+ The restart time is announced in the BGP graceful restart capability
+ and specifies how long the neighbor would wait for the BGP session to
+ re-establish after a restart before deleting stale routes. Default:
+ 120 seconds.
+
interpret communities RFC 1997 demands
that BGP speaker should process well-known communities like
no-export (65535, 65281) or no-advertise (65535, 65282). For
@@ -2063,25 +2107,36 @@ overcome using another routing table and the pipe protocol.
Configuration
- persist Tell BIRD to leave all its routes in the
- routing tables when it exits (instead of cleaning them up).
- scan time Time in seconds between two consecutive scans of the
- kernel routing table.
- learn Enable learning of routes added to the kernel
- routing tables by other routing daemons or by the system administrator.
- This is possible only on systems which support identification of route
- authorship.
+ persist
+ Tell BIRD to leave all its routes in the routing tables when it exits
+ (instead of cleaning them up).
- device routes Enable export of device
- routes to the kernel routing table. By default, such routes
- are rejected (with the exception of explicitly configured
- device routes from the static protocol) regardless of the
- export filter to protect device routes in kernel routing table
- (managed by OS itself) from accidental overwriting or erasing.
+ scan time
+ Time in seconds between two consecutive scans of the kernel routing
+ table.
- kernel table Select which kernel table should
- this particular instance of the Kernel protocol work with. Available
- only on systems supporting multiple routing tables.
+ learn
+ Enable learning of routes added to the kernel routing tables by other
+ routing daemons or by the system administrator. This is possible only on
+ systems which support identification of route authorship.
+
+ device routes
+ Enable export of device routes to the kernel routing table. By default,
+ such routes are rejected (with the exception of explicitly configured
+ device routes from the static protocol) regardless of the export filter
+ to protect device routes in kernel routing table (managed by OS itself)
+ from accidental overwriting or erasing.
+
+ kernel table
+ Select which kernel table should this particular instance of the Kernel
+ protocol work with. Available only on systems supporting multiple
+ routing tables.
+
+ graceful restart
+ Participate in graceful restart recovery. If this option is enabled and
+ a graceful restart recovery is active, the Kernel protocol will defer
+ synchronization of routing tables until the end of the recovery. Note
+ that import of kernel routes to BIRD is not affected.
Attributes
diff --git a/nest/proto.c b/nest/proto.c
index 2bc3e319..e990b48f 100644
--- a/nest/proto.c
+++ b/nest/proto.c
@@ -51,6 +51,8 @@ static char *c_states[] = { "HUNGRY", "???", "HAPPY", "FLUSHING" };
static void proto_flush_loop(void *);
static void proto_shutdown_loop(struct timer *);
static void proto_rethink_goal(struct proto *p);
+static void proto_want_export_up(struct proto *p);
+static void proto_fell_down(struct proto *p);
static char *proto_state_name(struct proto *p);
static void
@@ -151,21 +153,20 @@ extern pool *rt_table_pool;
* @t: routing table to connect to
* @stats: per-table protocol statistics
*
- * This function creates a connection between the protocol instance @p
- * and the routing table @t, making the protocol hear all changes in
- * the table.
+ * This function creates a connection between the protocol instance @p and the
+ * routing table @t, making the protocol hear all changes in the table.
*
- * The announce hook is linked in the protocol ahook list and, if the
- * protocol accepts routes, also in the table ahook list. Announce
- * hooks are allocated from the routing table resource pool, they are
- * unlinked from the table ahook list after the protocol went down,
- * (in proto_schedule_flush()) and they are automatically freed after the
- * protocol is flushed (in proto_fell_down()).
+ * The announce hook is linked in the protocol ahook list. Announce hooks are
+ * allocated from the routing table resource pool and when protocol accepts
+ * routes also in the table ahook list. The are linked to the table ahook list
+ * and unlinked from it depending on export_state (in proto_want_export_up() and
+ * proto_want_export_down()) and they are automatically freed after the protocol
+ * is flushed (in proto_fell_down()).
*
- * Unless you want to listen to multiple routing tables (as the Pipe
- * protocol does), you needn't to worry about this function since the
- * connection to the protocol's primary routing table is initialized
- * automatically by the core code.
+ * Unless you want to listen to multiple routing tables (as the Pipe protocol
+ * does), you needn't to worry about this function since the connection to the
+ * protocol's primary routing table is initialized automatically by the core
+ * code.
*/
struct announce_hook *
proto_add_announce_hook(struct proto *p, struct rtable *t, struct proto_stats *stats)
@@ -183,7 +184,7 @@ proto_add_announce_hook(struct proto *p, struct rtable *t, struct proto_stats *s
h->next = p->ahooks;
p->ahooks = h;
- if (p->rt_notify && (p->export_state == ES_READY))
+ if (p->rt_notify && (p->export_state != ES_DOWN))
add_tail(&t->hooks, &h->n);
return h;
}
@@ -659,16 +660,59 @@ proto_rethink_goal(struct proto *p)
}
+/**
+ * DOC: Graceful restart recovery
+ *
+ * Graceful restart of a router is a process when the routing plane (e.g. BIRD)
+ * restarts but both the forwarding plane (e.g kernel routing table) and routing
+ * neighbors keep proper routes, and therefore uninterrupted packet forwarding
+ * is maintained.
+ *
+ * BIRD implements graceful restart recovery by deferring export of routes to
+ * protocols until routing tables are refilled with the expected content. After
+ * start, protocols generate routes as usual, but routes are not propagated to
+ * them, until protocols report that they generated all routes. After that,
+ * graceful restart recovery is finished and the export (and the initial feed)
+ * to protocols is enabled.
+ *
+ * When graceful restart recovery need is detected during initialization, then
+ * enabled protocols are marked with @gr_recovery flag before start. Such
+ * protocols then decide how to proceed with graceful restart, participation is
+ * voluntary. Protocols could lock the recovery by proto_graceful_restart_lock()
+ * (stored in @gr_lock flag), which means that they want to postpone the end of
+ * the recovery until they converge and then unlock it. They also could set
+ * @gr_wait before advancing to %PS_UP, which means that the core should defer
+ * route export to that protocol until the end of the recovery. This should be
+ * done by protocols that expect their neigbors to keep the proper routes
+ * (kernel table, BGP sessions with BGP graceful restart capability).
+ *
+ * The graceful restart recovery is finished when either all graceful restart
+ * locks are unlocked or when graceful restart wait timer fires.
+ *
+ */
-static void graceful_restart_done(struct timer *t UNUSED);
-static void proto_want_export_up(struct proto *p);
+static void graceful_restart_done(struct timer *t);
+/**
+ * graceful_restart_recovery - request initial graceful restart recovery
+ *
+ * Called by the platform initialization code if the need for recovery
+ * after graceful restart is detected during boot. Have to be called
+ * before protos_commit().
+ */
void
graceful_restart_recovery(void)
{
graceful_restart_state = GRS_INIT;
}
+/**
+ * graceful_restart_init - initialize graceful restart
+ *
+ * When graceful restart recovery was requested, the function starts an active
+ * phase of the recovery and initializes graceful restart wait timer. The
+ * function have to be called after protos_commit().
+ */
void
graceful_restart_init(void)
{
@@ -689,6 +733,15 @@ graceful_restart_init(void)
tm_start(gr_wait_timer, config->gr_wait);
}
+/**
+ * graceful_restart_done - finalize graceful restart
+ *
+ * When there are no locks on graceful restart, the functions finalizes the
+ * graceful restart recovery. Protocols postponing route export until the end of
+ * the recovery are awakened and the export to them is enabled. All other
+ * related state is cleared. The function is also called when the graceful
+ * restart wait timer fires (but there are still some locks).
+ */
static void
graceful_restart_done(struct timer *t UNUSED)
{
@@ -727,7 +780,19 @@ graceful_restart_show_status(void)
cli_msg(-24, " Wait timer is %d/%d", tm_remains(gr_wait_timer), config->gr_wait);
}
-/* Just from start hook */
+/**
+ * proto_graceful_restart_lock - lock graceful restart by protocol
+ * @p: protocol instance
+ *
+ * This function allows a protocol to postpone the end of graceful restart
+ * recovery until it converges. The lock is removed when the protocol calls
+ * proto_graceful_restart_unlock() or when the protocol is stopped.
+ *
+ * The function have to be called during the initial phase of graceful restart
+ * recovery and only for protocols that are part of graceful restart (i.e. their
+ * @gr_recovery is set), which means it should be called from protocol start
+ * hooks.
+ */
void
proto_graceful_restart_lock(struct proto *p)
{
@@ -741,6 +806,13 @@ proto_graceful_restart_lock(struct proto *p)
graceful_restart_locks++;
}
+/**
+ * proto_graceful_restart_unlock - unlock graceful restart by protocol
+ * @p: protocol instance
+ *
+ * This function unlocks a lock from proto_graceful_restart_lock(). It is also
+ * automatically called when the lock holding protocol went down.
+ */
void
proto_graceful_restart_unlock(struct proto *p)
{
@@ -867,29 +939,6 @@ protos_build(void)
proto_flush_event->hook = proto_flush_loop;
proto_shutdown_timer = tm_new(proto_pool);
proto_shutdown_timer->hook = proto_shutdown_loop;
- proto_shutdown_timer = tm_new(proto_pool);
- proto_shutdown_timer->hook = proto_shutdown_loop;
-}
-
-static void
-proto_fell_down(struct proto *p)
-{
- DBG("Protocol %s down\n", p->name);
-
- u32 all_routes = p->stats.imp_routes + p->stats.filt_routes;
- if (all_routes != 0)
- log(L_ERR "Protocol %s is down but still has %d routes", p->name, all_routes);
-
- bzero(&p->stats, sizeof(struct proto_stats));
- proto_free_ahooks(p);
-
- if (! p->proto->multitable)
- rt_unlock_table(p->table);
-
- if (p->proto->cleanup)
- p->proto->cleanup(p);
-
- proto_rethink_goal(p);
}
static void
@@ -1066,6 +1115,10 @@ proto_request_feeding(struct proto *p)
{
ASSERT(p->proto_state == PS_UP);
+ /* Do nothing if we are still waiting for feeding */
+ if (p->export_state == ES_DOWN)
+ return;
+
/* If we are already feeding, we want to restart it */
if (p->export_state == ES_FEEDING)
{
@@ -1220,6 +1273,27 @@ proto_falling_down(struct proto *p)
proto_graceful_restart_unlock(p);
}
+static void
+proto_fell_down(struct proto *p)
+{
+ DBG("Protocol %s down\n", p->name);
+
+ u32 all_routes = p->stats.imp_routes + p->stats.filt_routes;
+ if (all_routes != 0)
+ log(L_ERR "Protocol %s is down but still has %d routes", p->name, all_routes);
+
+ bzero(&p->stats, sizeof(struct proto_stats));
+ proto_free_ahooks(p);
+
+ if (! p->proto->multitable)
+ rt_unlock_table(p->table);
+
+ if (p->proto->cleanup)
+ p->proto->cleanup(p);
+
+ proto_rethink_goal(p);
+}
+
/**
* proto_notify_state - notify core about protocol state change
diff --git a/nest/rt-table.c b/nest/rt-table.c
index bc911729..4295f836 100644
--- a/nest/rt-table.c
+++ b/nest/rt-table.c
@@ -1110,6 +1110,21 @@ rt_examine(rtable *t, ip_addr prefix, int pxlen, struct proto *p, struct filter
return v > 0;
}
+
+/**
+ * rt_refresh_begin - start a refresh cycle
+ * @t: related routing table
+ * @ah: related announce hook
+ *
+ * This function starts a refresh cycle for given routing table and announce
+ * hook. The refresh cycle is a sequence where the protocol sends all its valid
+ * routes to the routing table (by rte_update()). After that, all protocol
+ * routes (more precisely routes with @ah as @sender) not sent during the
+ * refresh cycle but still in the table from the past are pruned. This is
+ * implemented by marking all related routes as stale by REF_STALE flag in
+ * rt_refresh_begin(), then marking all related stale routes with REF_DISCARD
+ * flag in rt_refresh_end() and then removing such routes in the prune loop.
+ */
void
rt_refresh_begin(rtable *t, struct announce_hook *ah)
{
@@ -1126,6 +1141,14 @@ rt_refresh_begin(rtable *t, struct announce_hook *ah)
FIB_WALK_END;
}
+/**
+ * rt_refresh_end - end a refresh cycle
+ * @t: related routing table
+ * @ah: related announce hook
+ *
+ * This function starts a refresh cycle for given routing table and announce
+ * hook. See rt_refresh_begin() for description of refresh cycles.
+ */
void
rt_refresh_end(rtable *t, struct announce_hook *ah)
{
@@ -1405,6 +1428,19 @@ again:
return 1;
}
+/**
+ * rt_prune_table - prune a routing table
+ *
+ * This function scans the routing table @tab and removes routes belonging to
+ * flushing protocols, discarded routes and also stale network entries, in a
+ * similar fashion like rt_prune_loop(). Returns 1 when all such routes are
+ * pruned. Contrary to rt_prune_loop(), this function is not a part of the
+ * protocol flushing loop, but it is called from rt_event() for just one routing
+ * table.
+ *
+ * Note that rt_prune_table() and rt_prune_loop() share (for each table) the
+ * prune state (@prune_state) and also the pruning iterator (@prune_fit).
+ */
static inline int
rt_prune_table(rtable *tab)
{
@@ -1415,16 +1451,15 @@ rt_prune_table(rtable *tab)
/**
* rt_prune_loop - prune routing tables
*
- * The prune loop scans routing tables and removes routes belonging to
- * flushing protocols and also stale network entries. Returns 1 when
- * all such routes are pruned. It is a part of the protocol flushing
- * loop.
+ * The prune loop scans routing tables and removes routes belonging to flushing
+ * protocols, discarded routes and also stale network entries. Returns 1 when
+ * all such routes are pruned. It is a part of the protocol flushing loop.
*
- * The prune loop runs in two steps. In the first step it prunes just
- * the routes with flushing senders (in explicitly marked tables) so
- * the route removal is propagated as usual. In the second step, all
- * remaining relevant routes are removed. Ideally, there shouldn't be
- * any, but it happens when pipe filters are changed.
+ * The prune loop runs in two steps. In the first step it prunes just the routes
+ * with flushing senders (in explicitly marked tables) so the route removal is
+ * propagated as usual. In the second step, all remaining relevant routes are
+ * removed. Ideally, there shouldn't be any, but it happens when pipe filters
+ * are changed.
*/
int
rt_prune_loop(void)
diff --git a/proto/bgp/bgp.c b/proto/bgp/bgp.c
index ae9f6877..326883dd 100644
--- a/proto/bgp/bgp.c
+++ b/proto/bgp/bgp.c
@@ -51,6 +51,16 @@
* and bgp_encode_attrs() which does the converse. Both functions are built around a
* @bgp_attr_table array describing all important characteristics of all known attributes.
* Unknown transitive attributes are attached to the route as %EAF_TYPE_OPAQUE byte streams.
+ *
+ * BGP protocol implements graceful restart in both restarting (local restart)
+ * and receiving (neighbor restart) roles. The first is handled mostly by the
+ * graceful restart code in the nest, BGP protocol just handles capabilities,
+ * sets @gr_wait and locks graceful restart until end-of-RIB mark is received.
+ * The second is implemented by internal restart of the BGP state to %BS_IDLE
+ * and protocol state to %PS_START, but keeping the protocol up from the core
+ * point of view and therefore maintaining received routes. Routing table
+ * refresh cycle (rt_refresh_begin(), rt_refresh_end()) is used for removing
+ * stale routes after reestablishment of BGP session during graceful restart.
*/
#undef LOCAL_DEBUG
@@ -431,6 +441,17 @@ bgp_conn_enter_idle_state(struct bgp_conn *conn)
bgp_conn_leave_established_state(p);
}
+/**
+ * bgp_handle_graceful_restart - handle detected BGP graceful restart
+ * @p: BGP instance
+ *
+ * This function is called when a BGP graceful restart of the neighbor is
+ * detected (when the TCP connection fails or when a new TCP connection
+ * appears). The function activates processing of the restart - starts routing
+ * table refresh cycle and activates BGP restart timer. The protocol state goes
+ * back to %PS_START, but changing BGP state back to %BS_IDLE is left for the
+ * caller.
+ */
void
bgp_handle_graceful_restart(struct bgp_proto *p)
{
@@ -448,6 +469,16 @@ bgp_handle_graceful_restart(struct bgp_proto *p)
rt_refresh_begin(p->p.main_ahook->table, p->p.main_ahook);
}
+/**
+ * bgp_graceful_restart_done - finish active BGP graceful restart
+ * @p: BGP instance
+ *
+ * This function is called when the active BGP graceful restart of the neighbor
+ * should be finished - either successfully (the neighbor sends all paths and
+ * reports end-of-RIB on the new session) or unsuccessfully (the neighbor does
+ * not support BGP graceful restart on the new session). The function ends
+ * routing table refresh cycle and stops BGP restart timer.
+ */
void
bgp_graceful_restart_done(struct bgp_proto *p)
{
@@ -457,6 +488,15 @@ bgp_graceful_restart_done(struct bgp_proto *p)
rt_refresh_end(p->p.main_ahook->table, p->p.main_ahook);
}
+/**
+ * bgp_graceful_restart_timeout - timeout of graceful restart 'restart timer'
+ * @t: timer
+ *
+ * This function is a timeout hook for @gr_timer, implementing BGP restart time
+ * limit for reestablisment of the BGP session after the graceful restart. When
+ * fired, we just proceed with the usual protocol restart.
+ */
+
static void
bgp_graceful_restart_timeout(timer *t)
{
@@ -968,7 +1008,7 @@ bgp_start(struct proto *P)
p->remote_id = 0;
p->source_addr = p->cf->source_addr;
- if (P->gr_recovery)
+ if (p->p.gr_recovery && p->cf->gr_mode)
proto_graceful_restart_lock(P);
/*