Notes on Linux namespaces and related things

Some notes about Linux namespaces and cgroups, based on the resources linked in the end of the page.

ip netns show and proc/

A quick search on ‘Linux namespaces’ usually turns up examples using ip netns, which might be confusing if it wasn’t used to create network namespaces, e.g. in the case of Docker (or Mininet). Namely, ip netns show will give you nothing, even when you clearly have things running in namespaces.

Referring to proc(5), information about a process’s namespaces and cgroups can be found under /proc/[pid]/ns and /proc/[pid]/cgroup, respectively. ip netns bind-mounts its ns directory (/proc/self/ns/net) to /run/netns, from where it gets the information that it would list. Hence, if netns wasn’t used, no bind mounted directory, and no listing.

ps(1)

In a related vein, given a process, it’s also possible to see some information about the namespaces and cgroups that it is associated with. Certain dialects of ps will let you display this information:

$ ps -eo pid,cgroup,netns,pidns
 6017 2:name=systemd:/user/1000.u 4026531956 4026531836
 6018 2:name=systemd:/user/1000.u 4026531956 4026531836
22772 11:hugetlb:/docker/505f3032          -          -
22843 11:hugetlb:/docker/a3685093          -          -

The options here would be:

  • cgroup – display control groups to which the process belongs.
  • ipcns – inode number describing the (IPC) namespace the process belongs to.
  • mntns – inode number describing the (mount) namespace the process belongs to.
  • netns – inode number describing the (network) namespace the process belongs to.
  • pidns – inode number describing the (PID) namespace the process belongs to.
  • userns – inode number describing the (User) namespace the process belongs to.
  • utsns – inode number describing the (UTS) namespace the process belongs to.

More information about each namespace type can be found in namespaces(7), although the particular man page isn’t guaranteed to be shipped. The particular ps and man page are shipped with 14.04, but Ubuntu’s online man pages seem to mysteriously omit it. Links to both provided below.

Looking through proc/

Even without ps, the same information can be gleaned from /proc/[pid]/. The inode numbers of a process’s namespace can be read from /proc/[pid]/task/[pid]/ns/:

$ readlink /proc/25414/task/25414/ns/*                  
ipc:[4026531839]
mnt:[4026531840]
net:[4026531956]
pid:[4026531836]
user:[4026531837]
uts:[4026531838]

And the cgroup(s), from /proc/[pid]/task/[pid]/cgroup:

$ cat /proc/25414/task/25414/cgroup
11:hugetlb:/
10:perf_event:/
9:blkio:/
8:freezer:/
7:devices:/
6:memory:/
5:cpuacct:/
4:cpu:/
3:cpuset:/
2:name=systemd:/user/1000.user/17.session

References:

Adding a new Switch class to Mininet.

This is a quick post about making Mininet support new network objects, like switches.

Depending on the amount of features that a software switch (or other program) supports, and the degree to which you want to access those features through Mininet, adding a custom node type to Mininet can be a relatively simple task. For something as simple as a classic learning switch, this may be a matter of of implementing just a handful of methods within a child class of Switch.

An example: the LinuxBridge class

A good example of such a custom node is the LinuxBridge, prepackaged with Mininet and found in nodelib.py. This node, which implements a non-OpenFlow learning switch using the classic Linux bridge(8) device, just implements five methods, ignoring its constructor:

  • connected( self ):
    Returns True if the switch is connected to a controller or False if not. For a non-SDN switch, a connected state is when it is fully started up and therefore, ready for interaction. For the LinuxBridge, this is when brctl(8) shows that it is in the “forwarding” state.

  • start( self, _controllers ):
    Initializes and configures the software switch underlying the Mininet Switch object. For our example, this method invokes a series of brctl commands to initialize the switch, add ports, and to apply the necessary configurations (e.g. enable STP) to get it to the connected state. An OpenFlow switch such as OVS would also be configured to connect with a set of controllers passed into this method.

  • stop( self, deleteIntfs=True ):
    Carries out the teardown and removal of the underlying software switch backing the Mininet object. This method is the complement to start.

  • dpctl( self, *args ):
    Invokes the configuration utility for the software switch, parameterized to point to the particular instance associated with the Mininet object. In our case, this is, again, brctl.

  • setup( cls ):
    Checks that all prerequisites are met in order to start up and configure the software switch. This may entail checking that certain kernel modules are loaded, packages like bridge-utils are installed, and certain features, such as the firewall, are configured to allow the switch to function.

A bit of detail to keep note of is that the Switch class does its own housekeeping in its __init__ and stop methods, so these should be called from your custom class as well:

class LinuxBridge( Switch ):

    def __init__( self, name, stp=False, prio=None, **kwargs ):
        # snipped out initialization things here
        Switch.__init__( self, name, **kwargs )

    # snip...

    def stop( self, deleteIntfs=True ):
        # snipped out cleanup things here
        super( LinuxBridge, self ).stop( deleteIntfs )

Specifically, __init__ would be tied to initializing any major lightweight virtualization functions and facilities for sending commands to the process(es) underlying the switch object, and stop, the teardown and cleanup of the interfaces on the switch.

This meant that, by implementing the above methods in a IfBridge class, I was able to quickly put together a learning switch using if_bridge(4) and ifconfig(8) for my experimental Mininet port – barring one easily circumvented but odd bit of behavior. As a convention, custom nodes usually end up in nodelib.py, so that is where I ended up adding my IfBridge node.

Integrating a custom element into `mn`

mn allows you to launch a Mininet topology from the shell, and is one of the first things that the Mininet quickstart has you run to sanity check an install. It’s also handy for launching a topology to test with, without having to write a Python script. Among the options that mn comes with is the class of various network elements (switches, links, etc) to use when building the topology:

$ mn --help
Usage: mn [options]
(type mn -h for details)

The mn utility creates Mininet network from the command line. It can create
parametrized topologies, invoke the Mininet CLI, and run tests.

Options:
  -h, --help            show this help message and exit
  --switch=SWITCH       default|ivs|lxbr|ovs|ovsbr|ovsk|user[,param=value...]
                        ovs=OVSSwitch default=OVSSwitch ovsk=OVSSwitch
                        lxbr=LinuxBridge user=UserSwitch ivs=IVSSwitch
                        ovsbr=OVSBridge
  --host=HOST           cfs|proc|rt[,param=value...]
                        rt=CPULimitedHost{'sched': 'rt'} proc=Host
                        cfs=CPULimitedHost{'sched': 'cfs'}
...

Adding your custom Switch to this list is a matter of a few changes in mn. mn keeps a mapping between the various Node and Link classes and their aliases (e.g. default for OVS, lxbr for LinuxBridge …). So in my case, I have my IfBridge custom switch, which I aliased to ‘ifbr’ in the SWITCHES map:

SWITCHES = { 'user': UserSwitch,
             'ovs': OVSSwitch,
             ...
             'ifbr': IfBridge }

At this point, your custom class should be displayed alongside the other choices in the help message:

  --switch=SWITCH       default|ifbr|ivs|lxbr|ovs|ovsbr|ovsk|user[,param=value
                        ...] ovs=OVSSwitch lxbr=LinuxBridge user=UserSwitch
                        ivs=IVSSwitch default=OVSSwitch ovsk=OVSSwitch
                        ovsbr=OVSBridge ifbr=IfBridge

The rest of the script more or less sanity-checks the combination of classes that had been chosen during invocation, before launching a Mininet object. The sanity checks done for each network element type depends on their features. For example, if --switch=default is specified, mn checks for an OpenFlow controller, since the OVS switch would need a controller to connect to. If a non-functional dummy controller (--controller=none) had been used in the invocation, it will fall back to using bridge-mode OVS (ovsbr) even if ‘default’ was chosen for the switch.

In the case of my custom ‘ifbr’ switch, since it is a good-old-fashioned learning switch, the only extra work I needed to do was to add it to the list of non-OpenFlow switch types that are available for use with the ‘none’ controller:

    def begin( self ):
        "Create and run mininet."
        # snip...
        if not self.options.controller:
            if not CONTROLLERS[ 'default' ]:
            # snip...
                elif self.options.switch not in ( 'ovsbr', 'lxbr', 'ifbr' ):
                    raise Exception( "Could not find a default controller "
                                     "for switch %s" %
                                     self.options.switch )

At this point, I can run mn with my custom switch as so:

sudo ./mn --switch=ifbr --controller=none --test=pingall

VIMAGE Jails, renamed bridge devices, and errors.

This is something that I had noticed on 10.3 —

While putting together a Mininet Switch class that uses the if_bridge network bridge device, I noticed that there was some strange behavior: I would create a bridge within a jail, rename it, and when it came time to destroy it, I would receive an error:

ifconfig: SIOCIFDESTROY: Invalid argument

(If the ifconfig makes you wonder – Mininet Switch objects are merely references to shells started in a network namespace, or in my case, a jail, so they work as glorified shell scripts that configure a software switch of your choice.)

The bridge would then get destroyed along with the jail when Mininet runs jail -r, upon which it becomes “lost”. Trying to create another bridge with the same original name afterwards would fail:

# ifconfig bridge2 create
ifconfig: SIOCIFCREATE2: File exists

Manually recreating the steps that Mininet runs through with jexec let me reproduce this odd behavior.

# jexec 26 ifconfig bridge create
bridge2
# jexec 26 ifconfig bridge2 | grep bridge
bridge2: flags=8802 metric 0 mtu 1500
# jexec 26 ifconfig bridge2 name test2 
# jexec 26 ifconfig test2 | grep test
test2: flags=8802 metric 0 mtu 1500
# jexec 26 ifconfig test2 destroy
ifconfig: SIOCIFDESTROY: Invalid argument

Playing around a bit more revealed that renaming it back to any name of the original format (bridge[n], n a number), would allow the bridge to be destroyed using that new-new name:

# jexec 26 ifconfig test2 name bridge100
# jexec 26 ifconfig -a | grep bridge
bridge2: flags=8802 metric 0 mtu 1500
# jexec 26 ifconfig bridge100 destroy
# jexec 26 ifconfig -a | grep bridge
#

Another option is to pull the renamed bridge from the jail, and to destroy it in the host environment.

The workaround in my case was simply to not rename the bridge device to match the name of the Switch object – this also had the added benefit of reducing the number of ifconfig invocations.

Now, the correct solution would probably be to file a bug report, or to dig around myself…


UPDATE: The behavior seems to have been fixed in 11.0.

epairs and duplicate address (DAD) warnings.

After the initial step of determining the set of commands needed to bring up a VIMAGE/jail network (described here), I started reorganizing it to more closely mimic the order in which these commands would be called by a Mininet script. The typical order of operations and their corresponding commands are roughly:

  1. Instantiate a topology (Topo) object: (no corresponding step)
  2. Add Switches and Hosts to Topo object: Start up some jails with bridges, if they are switches, and shells, if they’re hosts
  3. Interconnect the Switches and Hosts by adding Links to the Topo object: create epairs, and move the interfaces to the jails)
  4. Initialize the network with the Topo object as its topology: Bring the interfaces up, and if the jails represent switches, add the interface to the bridge

But, while trying to recreate the same –linear,2 network, I noticed the following messages in dmesg:

epair2b: DAD detected duplicate IPv6 address fe80:2::ff:70ff:fe00:40b: NS in/out/loopback=4/1/0, NA in=0
epair2b: DAD complete for fe80:2::ff:70ff:fe00:40b - duplicate found
epair2b: manual intervention required
epair2b: possible hardware address duplication detected, disable IPv6
epair3b: DAD detected duplicate IPv6 address fe80:2::ff:70ff:fe00:40b: NS in/out/loopback=4/1/0, NA in=0
epair3b: DAD complete for fe80:2::ff:70ff:fe00:40b - duplicate found
epair3b: manual intervention required
epair3b: possible hardware address duplication detected, disable IPv6

And a bit above, the following:

epair2a: Ethernet address: 02:ff:20:00:03:0a
epair2b: Ethernet address: 02:ff:70:00:04:0b
epair2a: link state changed to UP
epair2b: link state changed to UP
epair3a: Ethernet address: 02:ff:20:00:03:0a
epair3b: Ethernet address: 02:ff:70:00:04:0b
epair3a: link state changed to UP
epair3b: link state changed to UP

As the DAD (Duplicate Address Detection) warnings suggested, the same MAC (hardware) addresses were indeed being reused for the epair* interfaces being created.

Searching for the warnings eventually brought me to a thread describing the exact mechanics behind the issue – I had interleaved the steps for creating and moving the interfaces. The MAC addresses for epair* interfaces are generated from a globally tracked if_index counter. This value increases by one for each interface created (so +2 for each ifconfig epair create), and decreases by one for each destroyed or moved to a VIMAGE jail. The problem arises when epair creation is interleaved with moving them to jails; The if_index:

  1. increases by two for the first epairs created
  2. drops back down by two when they are moved, and
  3. take on the same values as for the first epairs when the next epairs are created

In fact, since the value of if_index is used directly in the 4th and 5th byte of the MAC address (first and last are hard-coded and the rest, set by other means), we can see in the dmesg output that the index values 3 and 4 are being reused repeatedly.

It also explains why I didn’t see this issue initially, since I was creating all of the epairs at once and then moving them later on, making the changes in if_index monotonic. While I could reorganize the commands so that it both follows Mininet’s conventions and the if_index doesn’t fluctuate, I manually assigned unique addresses to each epair for the time being:

# ifconfig epair1a ether 02:ff:00:00:01:14    #from s1 (jid 1) to h1 (jid 4)
# ifconfig epair1b ether 02:ff:00:00:01:41    #from s1 (jid 4) to h1 (jid 1)
...

Of course, I’ll use a less manual approach for generating a unique MAC address in Mininet.

[update]: if_index is not guaranteed to monotonically increase, but the number in the interface name (the ‘n’ in epair[n]) does, so I decided to use that as a base for my unique MACs.


References:

Creating networks with VIMAGE jails and epairs.

This is part of a series of notes on the experimental process of getting Mininet to run on FreeBSD.

The first step is to identify the components and commands that are required to implement the basic features. For an emulator like Mininet, this would be 1) the ability to build custom network topologies, and 2) the ability to interact with the topology by sending traffic across it, and monitoring the traffic flowing through the network.

Custom topologies

Mininet allows users to build custom network topologies by interconnecting node and link Mininet objects. Here, jails with VIMAGE replace the mount and network namespaces used to implement the nodes, and epairs replace the veth virtual Ethernet pairs implementing the links.

This link provides clear instructions for getting VIMAGE up and running for a simple topology, making it a good place to start. At the time that this post was written, the stable release (10.2) VIMAGE isn’t enabled by default, and required a custom kernel.

Since the initial (and primary) focus at this time is in building custom topologies, the jails aren’t given their own directory trees, and their paths are set to /.

Handling traffic

In addition to being able to build out topologies, Mininet also allows users to interact with their networks with tools such as ping, traceroute, and tcpdump, which require creating raw sockets from within the jails. This can be enabled by setting security.jail.allow_raw_sockets to 1, or by passing allow.raw_sockets as a command to the jail utility when creating the jails.

Finally, the jails that represent network nodes (e.g. switches and routers, as opposed to end hosts) need some mechanism to move traffic. In Mininet, this would typically be an OpenFlow-programmable software switch such as Open vSwitch or the CPqD software switch. Although the former is available in the ports collection, to reduce the number of moving parts, the if_bridge network device will be used for the time being to narrow down to the core set of commands needed to bring up a topology capable of carrying traffic.

Manual topology construction

The following steps identify the steps and commands required to manually construct what Mininet calls a linear,2 topology:

s1---s2
|    |
h1   h2

where h1 and h2 represent hosts on the network, and s1 and s2, the network nodes (switches).

  1. Prepare the host. After enabling VIMAGE in the kernel:
    # kldload if_bridge
    # sysctl security.jail.allow_raw_sockets=1
  2. Create jails. Since allow_raw_sockets was set in the host, there is no need to pass allow.raw_sockets to jail.
    # jail -c vnet name=s1 jid=1 path=/ persist
    # jail -c vnet name=s2 jid=2 path=/ persist
    # jail -c vnet name=h1 jid=3 path=/ persist
    # jail -c vnet name=h2 jid=4 path=/ persist
    

    jls should now show your jails (jls -v will show you more, including the assigned names):

    # jls
       JID  IP Address      Hostname                      Path
         1  -                                             /
         2  -                                             /
         3  -                                             /
         4  -                                             /
  3. Create bridges in the ‘network node’ jails (JIDs 1,2, and 3)
    # jexec s1 ifconfig bridge1 create up
    # jexec s2 ifconfig bridge2 create up
  4. Create virtual Ethernet links (epairs) and interconnect the jails
    # ifconfig epair1 create      # s1  h1
    # ifconfig epair2 create      # s2  h2
    # ifconfig epair3 create      # s1  s2
    # ifconfig epair1a vnet s1
    # ifconfig epair1b vnet h1
    # ifconfig epair2a vnet s2
    # ifconfig epair2b vnet h2
    # ifconfig epair3a vnet s1
    # ifconfig epair3b vnet s2
  5. Add epair interfaces to each bridge and bring them up
    jexec s1 ifconfig bridge1 addm epair1a addm epair3a
    jexec s1 ifconfig epair1a up
    jexec s1 ifconfig epair3a up
    jexec s2 ifconfig bridge2 addm epair2a addm epair3b
    jexec s2 ifconfig epair2a up
    jexec s2 ifconfig epair3b up
  6. Configure IP addresses for ‘host’ jail interfaces
    # jexec h1 ifconfig epair1b 10.0.0.1 up
    # jexec h2 ifconfig epair2b 10.0.0.2 up

Sanity-checking the topology

It should now be possible to ping from one host to another:

# jexec h1 ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2): 56 data bytes
64 bytes from 10.0.0.2: icmp_seq=0 ttl=64 time=0.046 ms
...
^C
--- 10.0.0.2 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.046/0.052/0.055/0.004 ms

It should also be possible to monitor the traffic passing through a network node (e.g. by running tcpdump) while the hosts are pinging one another.

Teardown

Once a topology is no longer needed, it should be torn down and the virtual links and jails destroyed.

  1. Remove epairs from jails and destroy them (removing one end of an epair destroys both endpoints)
    # ifconfig epair1a -vnet s1
    # ifconfig epair2a -vnet s2
    # ifconfig epair1a destroy
    # ifconfig epair2a destroy
  2. Destroy the bridges
    jexec s1 ifconfig bridge1 destroy
    jexec s2 ifconfig bridge2 destroy
  3. Destroy jails
    jail -r s1
    jail -r s2
    jail -r h1
    jail -r h2

The idea is that the commands (and procedures) that have been identified here can be retrofitted into Mininet.

[to be continued]

Changing pager settings for Git.

It seems that on FreeBSD, the default pager is more. For Git commands involving the pager, this has the effect of displaying ANSI color escape sequences as ‘ESC[ …’ rather than coloring the text:

$ git diff
ESC[1mdiff --git a/mininet/link.py b/mininet/link.pyESC[m
ESC[1mindex 9703ce7..559b5da 100644ESC[m
ESC[1m--- a/mininet/link.pyESC[m
ESC[1m+++ b/mininet/link.pyESC[m
ESC[36m@@ -25,7 +25,7 @@ESC[m
 """ESC[m
 ESC[m
 from mininet.log import info, error, debugESC[m
ESC[31m-from mininet.util import makeIntfPairESC[m
ESC[32m+ESC[mESC[32mfrom mininet.util import makeIntfPair, quietRun
...

A quick search of the man pages for more (which actually leads to less(1)) shows that the -R flag would allow the raw (ANSI) control characters to be displayed properly:

       -R or --RAW-CONTROL-CHARS
              Like  -r,  but  only ANSI "color" escape sequences are output in
              "raw" form.  Unlike -r, the screen appearance is maintained cor-
              rectly  in  most  cases.   ANSI  "color"  escape  sequences  are
              sequences of the form:

                   ESC [ ... m

Setting the PAGER environment variable to ‘more -R’ is a solution, but one way to only affect Git’s behavior is to set its configurations using git config:

$ git config --global core.pager 'more -R'

References:

Notes on unionfs and nullfs.

unionfs is a type of filesystem that allows you to combine two directory trees into one, where the contents of both are visible from the mount point of the filesystem. The filesystem can be mounted either with mount or mount_unionfs:

mount -t unionfs [directory] [mountpoint]
mount_unionfs [directory] [mountpoint]

Running either of the above will union mount [directory] onto [mountpoint], making the contents of both visible in the latter. Important to note is that [directory] becomes the upper layer, and [mountpoint], the lower layer. The upper layer is essentially where the changes made at the [mountpoint] persist after umount. Adding -o below, makes [directory] the lower layer and [mountpoint] the upper layer.

For files with the same name in both layers, the upper layer file is visible from [mountpoint]. This may be an issue if you still need to access the lower layer file (i.e. if your lower layer is [mountpoint], as in the default behavior of unionfs mount). One way to keep access to the lower layer file is to use the nullfs loopback filesystem to duplicate the mountpoint’s tree elsewhere before employing unionfs:

mount -t nullfs [unionfs mountpoint] [copy mountpoint] # or use mount_nullfs
mount -t unionfs [directory] [unionfs mountpoint]

The above will let you access just the original files in [unionfs mountpoint] from [copy mountpoint]. Additionally, whatever changes you make in [copy mountpoint] will take effect and persist in [directory].

Another usage model might look like this:

mount -t nullfs [shared directory] [copy mountpoint]
mount -t unionfs [directory] [copy mountpoint]

In this case, if the two commands are repeated for different [directory]’s and [copy mountpoint]’s, each [directory] will have access to identical copies of [shared directory]. Having access to multiple identical copies of the same directory is useful when, for example, setting up the ports tree for multiple jails in a space-efficient way.

This contrasts with the previous case, where each [directory] will have access to the incremental results of previous unionfs invocations.

As a side-note, nullfs’s functions might be compared to using the --bind option for mount on Linux 2.4.0 and later.


References: