Introduction
In this short post I explain what are YAML aliases and anchors. I then show you how to stop PyYAML from using these when serializing data structures.
While references are absolutely fine to use in YAML files meant for programmatic consumption I find that it sometimes confuses humans, especially if they’ve never seen these before. For this reason I tend to disable anchors and aliases when saving data to YAML files meant for human consumption.
Contents
- YAML aliases and anchors
- When does YAML use anchors and aliases
- YAML dump - don’t use anchors and aliases
- Conclusion
- References
- GitHub repository with resources for this post
YAML aliases and anchors
YAML specification has provision for preserving information about nodes pointing to the same data. This basically means that if you have some data that is referenced in multiple places in your data structure then YAML dumper will:
- add an anchor to the first occurrence
- replace any subsequent occurrences of that data with aliases
Now, how do these anchors and aliases look like?
&id001
- example of an anchor, placed with the first occurrence of data
*id001
- example of an alias, replaces subsequent occurrence of data
This might be easier to see on a concrete example. Below we have information about some interfaces. You can see that Ethernet1 has anchor &id001
next to its properties key and Ethernet2 has just alias *id001
next to its properties key.
Ethernet1:
description: Uplink to core-1
mtu: 9000
properties: &id001
- pim
- ptp
- lldp
speed: 1000
Ethernet2:
description: Uplink to core-2
mtu: 9000
properties: *id001
speed: 1000
When we load this data in Python and print it we get the below:
{'Ethernet1': {'description': 'Uplink to core-1',
'mtu': 9000,
'properties': ['pim', 'ptp', 'lldp'],
'speed': 1000},
'Ethernet2': {'description': 'Uplink to core-2',
'mtu': 9000,
'properties': ['pim', 'ptp', 'lldp'],
'speed': 1000}}
Anchor &id001
is gone and alias *id001
was expanded into ['pim', 'ptp', 'lldp']
.
When does YAML use anchors and aliases
Dumper used by PyYAML can recognize Python variables and data structures pointing to the same object. This can often happen with deeply nested dictionaries with keys that refer to the same piece of data. A lot of APIs in the world of Network Automation use such dictionaries.
It’s worth pointing out that section 3.1.1 of YAML spec requires anchors and aliases to be used when serializing multiple references to the same node (data object). I will show how to override this behaviour but it’s good to know where it came from.
I wrote two, nearly identical, programs creating data structure from the beginning of this post. These will help us in understanding when PyYAML adds anchors and aliases.
Program #1:
# yaml_diff_ids.py
import yaml
interfaces = dict(
Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)
interfaces["Ethernet1"]["properties"] = ["pim", "ptp", "lldp"]
interfaces["Ethernet2"]["properties"] = ["pim", "ptp", "lldp"]
# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))
# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.safe_dump(interfaces))
Program #1 output:
Ethernet1 properties object id: 41184424
Ethernet2 properties object id: 41182536
##### Resulting YAML:
Ethernet1:
description: Uplink to core-1
mtu: 9000
properties:
- pim
- ptp
- lldp
speed: 1000
Ethernet2:
description: Uplink to core-2
mtu: 9000
properties:
- pim
- ptp
- lldp
speed: 1000
Program #2:
# yaml_same_ids.py
import yaml
interfaces = dict(
Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)
prop_vals = ["pim", "ptp", "lldp"]
interfaces["Ethernet1"]["properties"] = prop_vals
interfaces["Ethernet2"]["properties"] = prop_vals
# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))
# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.safe_dump(interfaces))
Program #2 output:
Ethernet1 properties object id: 13329416
Ethernet2 properties object id: 13329416
##### Resulting YAML:
Ethernet1:
description: Uplink to core-1
mtu: 9000
properties: &id001
- pim
- ptp
- lldp
speed: 1000
Ethernet2:
description: Uplink to core-2
mtu: 9000
properties: *id001
speed: 1000
So, two pretty much identical programs, two data structures containing identical data but two different results of YAML dump.
What caused this difference? It’s all due to a tiny change in the way we assigned values to properties
key:
Program #1
interfaces["Ethernet1"]["properties"] = ["pim", "ptp", "lldp"]
interfaces["Ethernet2"]["properties"] = ["pim", "ptp", "lldp"]
Program #2
properties = ["pim", "ptp", "lldp"]
interfaces["Ethernet1"]["properties"] = properties
interfaces["Ethernet2"]["properties"] = properties
In Program #1 we created two new lists and passed the references to relevant properties
keys. These look to be the same but are actually two completely separate objects.
In Program #2 we first created a list which was assigned to prop_vals
variable. We then assigned prop_vals
to each of the properties
keys. This essentially means that each of the keys now references the same list object.
We also asked Python to give us IDs of the objects referenced by properties
keys. Here we can see that indeed IDs in Program #1 differ but they’re the same in Program #2:
Program #1 IDs:
Ethernet1 properties object id: 41184424
Ethernet2 properties object id: 41182536
Program #2 IDs:
Ethernet1 properties object id: 13329416
Ethernet2 properties object id: 13329416
And that’s it. That’s how PyYAML knows it should use aliases and anchors to represent first and subsequent references to the same object.
For completeness, here’s an example of loading YAML file with references that we just dumped:
import yaml
with open("yaml_files/interfaces_same_ids.yml") as fin:
interfaces = yaml.safe_load(fin)
# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))
IDs of the loaded properties
keys:
Ethernet1 properties object id: 19630664
Ethernet2 properties object id: 19630664
As you can see IDs are the same, so information about properties
keys referencing same object was preserved.
YAML dump - don’t use anchors and aliases
You know know what anchors and aliases are, what they’re used for and where they come from. It’s now time to show you how to stop PyYAML from using them during dump operation.
PyYAML does not have built-in setting allowing disabling of the default behaviour. Fortunately there are two ways in which we can prevent references from being used:
- Force all data objects to have unique IDs by using
copy.deepcopy()
function - Override
ignore_aliases()
method in PyYAMLDumper
class
Method 1 might require source code modifications in multiple places and could be slow when copying large amounts of compound objects.
Method 2 only requires few lines of code to define custom dumper class. This can then be used alongside standard PyYAML dumper.
In any case, have a look at both and decide which one fits your case better.
Using copy.deepcopy() function
Python standard library provides us with copy.deepcopy()
function which returns copy of an object, and copies of objects within that object if any found.
As we’ve seen already, PyYAML serializer uses anchors and aliases when it finds references to the same object. By applying deepcopy()
during object assignment we’ll ensure all of these will have unique IDs. The end result? No YAML references in the final dump.
Program #2, modified to use deepcopy()
:
# yaml_same_ids_deep_copy.py
from copy import deepcopy
import yaml
interfaces = dict(
Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)
prop_vals = ["pim", "ptp", "lldp"]
interfaces["Ethernet1"]["properties"] = deepcopy(prop_vals)
interfaces["Ethernet2"]["properties"] = deepcopy(prop_vals)
# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))
# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.safe_dump(interfaces))
Result:
Ethernet1 properties object id: 19775848
Ethernet2 properties object id: 19823048
##### Resulting YAML:
Ethernet1:
description: Uplink to core-1
mtu: 9000
properties:
- pim
- ptp
- lldp
speed: 1000
Ethernet2:
description: Uplink to core-2
mtu: 9000
properties:
- pim
- ptp
- lldp
speed: 1000
We passed prop_vals
to deepcopy()
during each assignment resulting in two new copies of that data. In the output we have two different IDs even though we reused prop_vals
. The final YAML representation has no references, which is exactly what we wanted.
Overriding ignore_aliases()
method
To completely disable generation of YAML references we can sub-class Dumper
class and override its ignore_aliases
method:
Class definition, borrowed from Issue #103 posted on PyYAML GitHub page:
class NoAliasDumper(yaml.SafeDumper):
def ignore_aliases(self, data):
return True
You could also monkey-patch the actual Dumper class but I think this solution is safer and more elegant.
We’ll now take NoAliasDumper
and use it to modify Program #2:
# yaml_same_ids_custom_dumper.py
import yaml
class NoAliasDumper(yaml.SafeDumper):
def ignore_aliases(self, data):
return True
interfaces = dict(
Ethernet1=dict(description="Uplink to core-1", speed=1000, mtu=9000),
Ethernet2=dict(description="Uplink to core-2", speed=1000, mtu=9000),
)
prop_vals = ["pim", "ptp", "lldp"]
interfaces["Ethernet1"]["properties"] = prop_vals
interfaces["Ethernet2"]["properties"] = prop_vals
# Show IDs referenced by "properties" key
print("Ethernet1 properties object id:", id(interfaces["Ethernet1"]["properties"]))
print("Ethernet2 properties object id:", id(interfaces["Ethernet2"]["properties"]))
# Dump YAML to stdout
print("\n##### Resulting YAML:\n")
print(yaml.dump(interfaces, Dumper=NoAliasDumper))
Output:
Ethernet1 properties object id: 19455080
Ethernet2 properties object id: 19455080
##### Resulting YAML:
Ethernet1:
description: Uplink to core-1
mtu: 9000
properties:
- pim
- ptp
- lldp
speed: 1000
Ethernet2:
description: Uplink to core-2
mtu: 9000
properties:
- pim
- ptp
- lldp
speed: 1000
Perfect, properties
keys reference the same object but dumped YAML no longer uses aliases and anchors. This is exactly what we needed.
Note that I replaced yaml.safe_dump
with yaml.dump
in the above example. This is because we need to explicitly pass our modified Dumper
class. However NoAliasDumper
inherited from yaml.SafeDumper
class so we still get the same protection we do when using yaml.safe_dump
.
Conclusion
This brings us to the end of the post. I hope I helped you in understanding what are &id001
, *id001
found in YAML files and where they com from. You now also know how to stop PyYAML from using anchors and aliases when serializing data structures, should you ever need it.
References:
- PyYAML GitHub repository. Issue #103 Disable Aliases/Anchors: https://github.com/yaml/PyYAMLHi/issues/103
- YAML specification. Section 3.1.1. Dump: https://yaml.org/spec/1.2/spec.html#id2762313
- YAML specification. Section 6.9.2. Node Anchors. https://yaml.org/spec/1.2/spec.html#id2785586
- YAML specification. Section 7.1. Alias Nodes. https://yaml.org/spec/1.2/spec.html#id2786196
- GitHub repo with resources for this post. https://github.com/progala/ttl255.com/tree/master/yaml/anchors-and-aliases