Why we have decided to combine workflows with network data models - Frinx

A couple of words about our decision to combine workflows with network data models. We think it’s a match made in heaven. Here is why.

Network operators worldwide have embraced YANG models for services and devices and we have implemented that in FRINX UniConfig. Now, what we see is that our customers write their own workflow execution logic. For example in languages like Python that interacts with our UniConfig APIs, Ansible and with other systems. Among them are IPAM, CRM, inventory, AWS, GCP and many more.

Easy to build and hard to operate? Not anymore

Infrastructure engineers and network admins who write their own workflow execution logic usually tackle the following challenges. It is easy to build, but hard to operate a system that consists of many interdependent scripts, functions and playbooks. In systems where the workflow execution state is implicitly encoded in the functions and scripts that are calling each other, it is hard to determine to what execution stage the workflow has progressed, where it stopped or how its overall health is trending.

Modifying and augmenting existing workflows in such systems is hard. Only specialists with deep knowledge of the code and system behavior can modify them. Often reducing a team of specialists further down to the one person who has implemented the code in the first place. If that person is no longer around, the alternatives are “don’t touch” or “full re-write”.

Difficult tasks made simple

This challenge has been addressed by providing a cloud native infrastructure workflow engine that manages execution state, related statistics, provides events and triggers. Furthermore, it provides full visibility into the data being exchanged between tasks in a workflow. Think of it as an execution environment and real-time debugger for your infrastructure. Workflows can be started via REST APIs or via GUI and are executed in a fully decentralized way.

We use “workers” that implement the execution logic by polling the workflow engines. Those workers can be scaled up and down based on workload needs. The persistence layer uses REDIS for state and Elasticsearch for statistics and logs. This approach allows users to run workflows in parallel, scaled by the number of “workers”, with full visibility into execution state and statistics.

Everyone in the team can contribute

Having the ability to write simple tasks, i.e. functions dealing with the execution logic, and stringing them together using a workflow, opens this system up to personnel that previously did not have the expertise to deal with model based systems.

Anyone who is capable of writing code (e.g. Python), can contribute to these workflows, while interacting with one logical function, the workflow engine, that manages the administration and execution of all tasks. This approach enables new features and capabilities to move from development and test to staging and production in the shortest amount of time possible.

A single workflow for cloud and on-prem assets

Here is an example where we have combined two tasks that are often handled by separate teams into a single workflow. Infrastructure engineers and network admins need to create assets in the cloud and need to configure on-prem assets to connect with those cloud assets.

We have created a workflow that uses Terraform to create a VPC in AWS and that configures on-prem networking equipment with the correct IP addresses, VLAN information and BGP configurations to establish the direct connection to AWS. The result is that infrastructure engineers can provide a single, well defined API to northbound systems that activate the service. Examples for such northbound systems are Servicenow, Salesforce, JBPM based systems or any business process management system with a REST interface.

Let’s have a look

Here is the graph of our workflow before we have started the execution. The graph is defined by JSON and can be customized and augmented as needed by users.

Before we execute the workflow we need to provide the necessary input parameters, either via GUI, like shown below, or programmatically via a REST call.

After we have provided all parameters and have started the execution, we can monitor the progress of the workflow by seeing its color change from orange (Running) to green (Completed) or to red (Failure).

After the first two tasks have completed, we see that a new VPC and NAT Gateway were created in our AWS data center.

Terraform provides us with information from AWS. The second task in our workflow “Terraform apply” provides output variables that can be used by the following tasks. Here we receive the public IP address, VLAN and other information that we can process for our network device configuration.

In the workflow definition we see that we use the variable from the output of the terraform apply task to configure the BGP neighbor on the network device.

In the next steps we mount a network device and we configure it to provide connectivity to the AWS VPC, For demonstration purposes, we use two techniques for device configuration. One method uses templates with a parameter dictionary and the second method uses the FRINX UniConfig APIs with OpenConfig semantics and full transactionality.

A task that implements device configuration via a template is shown below. A template (“template”) with variables is passed and a dictionary of parameters (“params”), either static or dynamically obtained from other tasks in the workflow are passed to the template and executed on the device. All text is escaped so it can be handled in JSON.

The second method to apply configuration takes advantage of the UniConfig API. It starts with a “sync-from-network” call to obtain the latest configuration state from the network device that is to be configured. The next step is to load our intent, the BGP configuration, into the UniConfig datastore. Finally we issue a commit to apply the configuration to the network device. If this step fails, UniConfig takes care of rolling the configuration back and restoring the device to its state before the configuration attempt. Transactions can be performed on one or across multiple devices.

In conclusion, we can see the journal that reflects all information that has been sent to the device by examining the task output of the “read journal” task.

Below you can find the unescaped version of the journal content:

2018-11-21T16:46:59.919: show running-config
2018-11-21T16:47:02.478: show history all | include Configured from
2018-11-21T16:47:02.628: show running-config
2018-11-21T16:47:08.85: configure terminal
vlan 1111
!
vlan configuration 1111
!
end
2018-11-21T16:47:11.516: configure terminal
interface Vlan1111
description routed vlan 1111 interface for vpc vpc-123412341234
no shutdown
mtu 9000
no bfd echo
no ip redirects
ip address 10.30.32.54/31
ip unreachables
!
end
2018-11-21T16:47:15.927: configure terminal
interface Ethernet1/3
switchport trunk allowed vlan add 1111
!
end
2018-11-21T16:47:18.475: show history all | include Configured from
2018-11-21T16:47:18.516: show running-config
2018-11-21T16:47:34.545: configure terminal
router bgp 65000
neighbor 63.33.91.252 remote-as 65071
neighbor 63.33.91.252 description ^:tier2:bgp:int:uplink:vlan::abc:eth1/1:1111:trusted:abcdef
neighbor 63.33.91.252 update-source Vlan1111
neighbor 63.33.91.252 route-map DEFAULT_ONLY out
address-family ipv4
neighbor 63.33.91.252 activate
exit
end
2018-11-21T16:47:34.974: show history all | include Configured from

This concludes the workflow. Therefore we can design other workflows to change or clean up all assets in case the connection is decommissioned.

 

About FRINX

FRINX was founded in 2016 in Bratislava and consists of a team of passionate developers and industry professionals who want to change the way networking software is developed, deployed and used. FRINX offers distributions of OpenDaylight and FD.io in conjunction with support services and is proud to have service provider and enterprise customers from the Fortune Global 500 list.