Understanding Hardware Automation in Networking
type Linecard interface {
Online() error
Offline() error
Status() error
}
In the Go programming language, an error qualifier indicates that a function can return an error object upon failure. The implementation for a Juniper line card significantly diverges from that of a Cisco line card. Nevertheless, functions calling this interface remain oblivious to these differences. The primary code interacts with the library and is confined to executing one of three specified actions: Online(), Offline(), or Status().
Realizing this design flexibility allowed us to expand the utility of the interface model to incorporate various hardware elements—such as fans—where specific vendors might not support the ability to toggle power on or off. Consequently, for such instances, we primarily leveraged the status check functionality.
type Fan interface {
Online() error
Offline() error
Status() error
}
Building on this concept, we found it beneficial to formalize a broad component interface applicable universally across all device hardware.
type Component interface {
Online() error
Offline() error
Status() error
}
By adopting this framework, integration of new devices from different manufacturers became feasible. Additionally, any new type of component can be incorporated simply by implementing this universal interface within its respective library and registering it accordingly based on vendor criteria.
Determining Automation Scope
To forge our automated system effectively, human interaction was required at multiple stages throughout the process. We mapped out human-driven repair workflows using flowcharts and then identified areas ripe for automation replacements. A practical case highlighted was replacing a vendor’s control plane board where some steps carried clear definitions while others required elucidation:
- Identify Control Plane: Locate faulty control plane unit.
- Assess State: Determine if it is master or backup.
- Transfer Image: Move appropriate software image onto master control plane.
- Deactivate Control Plane: Take backup offline.
- Suspend Mastership: Declare replaced control plane as new master.
A dedicated Google network engineer executed each phase outlined in Figure 1 during actual operations; however, extracting and substituting defective components was facilitated by onsite personnel at our data center facilities.
This process paved the way for designing an automated workflow aiming to furnish our hardware engineers with an intuitive user-interface (UI) tool within data centers enabling them to execute necessary operations under defined circumstances aided by automation checks and concluding with comprehensive device audits post-operation. This transformation reduced tasks performed solely by humans down significantly; now their primary responsibility rested only on physically replacing hardware components as depicted in Figure 2 below.
The Shift From Manual Efforts To Full Automation
As illustrated in Figure 3 above showcasing post-automation advancements—it’s evident how much manual labor prior processes demanded prior.Automation streamlined procedures which previously encompassed extensive manual intervention beginning from alert reception when engineers halted traffic towards devices through disabling malfunctioning components manually before liaising with vendors like Juniper or Cisco regarding replacement parts.
Subsequently submission logs were recorded indicating operational dates spanning over several stages such as:
If we highlight key actions undertaken on operation day: