lxml official introductory tutorial (The Tutorial) translation
Description:
- First published: 2024-09-05
- Official tutorial link:/
- Flip with Kimi and Bean Bread Machine
- Please don't hesitate to point out any mistakes.
This is a tutorial on using Processing XML. It provides a brief overview of the main concepts of the ElementTree API, as well as some simple enhancements that can make your life as a programmer easier.
For a complete reference to the API, check outGenerated API documentation。
The common ways of importing are as follows:
from lxml import etree
If your code uses only the ElementTree API and does not depend on theAny unique features, you can also fall back to ElementTree in the Python standard library using the following import chain:
try:
from lxml import etree
print("running with ")
except ImportError:
import as etree
print("running with Python's ")
To help write portable code, this tutorial makes it clear in the examples which part of the presented API is an extension to the original ElementTree API.
The Element class
The Element is the main container object for the ElementTree API. Most XML tree functionality is accessed through this class. Elements can be accessed via theElement
factory is easy to create:
root = ("root")
The XML tag name of an element can be changed with thetag
Attribute Access:
print()
Elements are organized in an XML tree structure. To create child elements and add them to the parent element, you can use theappend()
Methods:
(("child"))
However, this is so common that there is a shorter and more efficient way to accomplish this:SubElement
Factory. It accepts the agreement withElement
Factory the same parameters, but additionally requires the parent element as the first parameter:
child2 = (root, "child2")
child3 = (root, "child3")
To confirm that this is indeed XML, you can serialize the tree you created:
(root)
b'<root><child1/><child2/><child3/></root>'
We'll create a small helper function that prints the XML beautifully for us:
def prettyprint(element, **kwargs):
xml = (element, pretty_print=True, **kwargs)
print((), end='')
prettyprint(root)
<root>
<child1/>
<child2/>
<child3/>
</root>
Elements are lists
To facilitate direct access to these child elements, elements mimic the behavior of normal Python lists as closely as possible:
>>> child = root[0]
>>> print()
child1
>>> print(len(root))
3
>>> (root[1]) # only!
1
>>> children = list(root)
>>> for child in root:
... print()
child1
child2
child3
>>> (0, ("child0"))
>>> start = root[:1]
>>> end = root[-1:]
>>> print(start[0].tag)
child0
>>> print(end[0].tag)
child3
Prior to ElementTree 1.3 and lxml 2.0, you could also check the truth value of an element to see if it had children, i.e. to see if the list of children was empty:
if root: # this no longer works!
print("The root element has children")
This practice is no longer supported because people tend to expect that "something" evaluates asTrue
The if statement does not expect Elements to be "something", whether or not they have children. As a result, many users find it surprising that any element evaluates to False in an if statement like the one above. Instead, uselen(element)
, which is both clearer and less error-prone.
print((root)) # test if it's some kind of Element
True
if len(root): # test if it has children
print("The root element has children")
In another important scenario, the behavior of elements in lxml (from version 2.0 and later) differs from the behavior of lists (lists) as well as the behavior of the original ElementTree (prior to version 1.3 or prior to Python 2.7/3.2):
for child in root:
print()
child0
child1
child2
child3
root[0] = root[-1] # this moves the element in !
for child in root:
print()
child3
child1
child2
In this example, the last element wasmobilityto a different location, rather than being copied, i.e., when it is placed in a different location, it is automatically removed from the previous location. In a list, objects can appear in more than one position at the same time, and the assignment operation described above only copies the item reference to the first position, so both contain exactly the same item:
>>> l = [0, 1, 2, 3]
>>> l[0] = l[-1]
>>> l
[3, 1, 2, 3]
Note that in the original ElementTree, a single element object could be located at any position in any number of trees, which allows the same copying operations as for lists. The obvious shortcoming is that modifications made to such an element will be applied to all locations where it appears in the tree, which may or may not be expected.
note: In lxml, the above assignment operation moves the element, unlike in lists and the original ElementTree.
The advantage of this difference is that an element in always has exactly one parent, which can be queried via the getparent() method. This is not supported in the original ElementTree.
root is root[0].getparent() # only!
If you want to copy an element to a different location in, consider creating a separate deep copy using the copy module in the Python standard library:
from copy import deepcopy
element = ("neu")
(deepcopy(root[1]))
print(element[0].tag)
# child1
print([ for c in root])
# ['child3', 'child1', 'child2']
The brother (or neighbor) of an element is accessed as the next and previous element:
root[0] is root[1].preprevious() # only!
# True
root[1] is root[0].getnext() # only!
Elements carry attributes as a dict
XML elements support attributes. You can create them directly in the Element factory:
root = ("root", interesting="totally")
(root)
# b'<root interesting="totally"/>'
Properties are just unorderedName-Valuepairs, so it's convenient to handle them through the element's dictionary-like interface:
print(("interesting"))
# totally
print(("hello"))
# None
("hello", "Huhu")
print(("hello"))
# Huhu
(root)
# b'<root interesting="totally" hello="Huhu"/>'
sorted(())
# ['hello', 'interesting']
for name, value in sorted(()):
print('%s = %r' % (name, value))
# hello = 'Huhu'
# interesting = 'totally'
In case you want to do item lookups or have other reasons to get a "real" dictionary-like object, e.g. to pass it, you can use theattrib
Properties:
>>> attributes =
>>> print(attributes["interesting"])
totally
>>> print(("no-such-attribute"))
None
>>> attributes["hello"] = "Guten Tag"
>>> print(attributes["hello"])
Guten Tag
>>> print(("hello"))
Guten Tag
Note that attrib is a dictionary-like object backed (backed) by the element itself. This means that any changes made to the element are reflected in the attrib and vice versa. It also means that as long as the XML tree has an element's attrib in use, the XML tree will remain active in memory. To get a standalone snapshot of an attribute that does not depend on the XML tree, copy it into a dictionary:
d = dict()
sorted(())
# ('hello', 'Guten Tag'), ('interesting', 'totally')]
Elements contain text
element can contain text:
root = ("root")
= "TEXT"
print()
# TEXT
(root)
# b'<root>TEXT</root>'
In many XML documents (data-centric documents), this is the only place where text can be found. It is encapsulated by a leaf tag at the bottom of the tree hierarchy.
However, if XML is used to mark up a text document, such as (X)HTML, text can also appear between different elements, right in the middle of the tree:
<html><body>Hello<br/>World</body></html>
Here.<br/>
The tags are surrounded by text. This is often referred to as document style or mixed-content XML. elements support this through their tail attribute. It contains text that directly follows the element until the next element in the XML tree:
>>> html = ("html")
>>> body = (html, "body")
>>> = "TEXT"
>>> (html)
b'<html><body>TEXT</body></html>'
>>> br = (body, "br")
>>> (html)
b'<html><body>TEXT<br/></body></html>'
>>> = "TAIL"
>>> (html)
b'<html><body>TEXT<br/>TAIL</body></html>'
.text
cap (a poem).tail
These two attributes are sufficient to represent any textual content in an XML document. In this way, the ElementTree API does not require any special text nodes other than the "Element" class, which tend to get in the way more often than not (as you may know from the traditional DOM API).
However, there are also cases where trailing text can get in the way. For example, when you serialize an element in a tree, you don't always want its trailing text to appear in the result (although you still want to include the trailing text of its children). For this reason, thetostring()
Functions accept keyword argumentswith_tail
:
>>> (br)
b'<br/>TAIL'
>>> (br, with_tail=False) # only!
b'<br/>'
If you want to read only the text, i.e. without any intermediate tags, you must recursively concatenate all text and trailing attributes in the correct order. Similarly, the "tostring()
" function can help, this time using the "method
" Keywords.
>>> (html, method="text")
b'TEXTTAIL'
Using XPath to find text
Another way to extract the text content of a tree is XPath, which also allows you to extract individual blocks of text into a list:
>>> print(("string()")) # only!
TEXTTAIL
>>> print(("//text()")) # only!
['TEXT', 'TAIL']
If you want to use this more frequently, you can wrap it in a function:
>>> build_text_list = ("//text()") # only!
>>> print(build_text_list(html))
['TEXT', 'TAIL']
Note that the string result returned by XPath is a special "smart" object that knows its source. You can find out where it came from with itsgetparent()
method asks where it came from, just as you would do with an element:
>>> texts = build_text_list(html)
>>> print(texts[0])
TEXT
>>> parent = texts[0].getparent()
>>> print()
body
>>> print(texts[1])
TAIL
>>> print(texts[1].getparent().tag)
br
You can also find out if it is plain text content or trailing text:
>>> print(texts[0].is_text)
True
>>> print(texts[1].is_text)
False
>>> print(texts[1].is_tail)
True
Although this works for the results of the text() function, lxml does not tell you the source of the string value constructed by the XPath functions string() or concat():
>>> stringify = ("string()")
>>> print(stringify(html))
TEXTTAIL
>>> print(stringify(html).getparent())
None
\
Tree iteration
Tree iteration is a very convenient solution to problems like the one described above, when you want to recursively traverse the tree and perform some operations on its elements. Elements provides a tree iterator for this purpose. It generates elements in document order, i.e., in the same order that their tags appear when the tree is serialized into XML.
>>> root = ("root")
>>> (root, "child").text = "Child 1"
>>> (root, "child").text = "Child 2"
>>> (root, "another").text = "Child 3"
>>> prettyprint(root)
<root>
<child>Child 1</child>
<child>Child 2</child>
<another>Child 3</another>
</root>
>>> for element in ():
... print(f"{} - {}")
root - None
child - Child 1
child - Child 2
another - Child 3
If you know you're only interested in a single tag, you can pass the tag name to theiter()
, let it filter for you. Starting with lxml 3.0, you can also pass multiple tags and intercept multiple tags during iteration.
>>> for element in ("child"):
... print(f"{} - {}")
child - Child 1
child - Child 2
>>> for element in ("another", "child"):
... print(f"{} - {}")
child - Child 1
child - Child 2
another - Child 3
By default, iteration generates all nodes in the tree, including ProcessingInstructions, Comments, and Entity instances. If you want to ensure that only Element objects are returned, you can pass the Element factory as a label parameter:
>>> (("#234"))
>>> (("some comment"))
>>> for element in ():
... if isinstance(, str):
... print(f"{} - {}")
... else:
... print(f"SPECIAL: {element} - {}")
root - None
child - Child 1
child - Child 2
another - Child 3
SPECIAL: ê - ê
SPECIAL: <!--some comment--> - some comment
>>> for element in (tag=):
... print(f"{} - {}")
root - None
child - Child 1
child - Child 2
another - Child 3
>>> for element in (tag=):
... print()
ê
Note that passing the wildcard "*
" as a tag name will also generate all theElement
nodes (and only element nodes).
for element in (tag="*"):
if isinstance(, str):
print(f" - {}")
else:
print(f"SPECIAL: {element} - {}")
- None
- Child 1
- Child 2
- Child 3
existMiddle.
elements
provides further iterators for all directions in the tree: child nodes (iterchildren()
), parent nodes (or more precisely, ancestor nodes) (iterancestors()
) and sibling nodes (itersiblings()
)。
Serialisation
Serialization is usually done using thetostring()
function returns a string, or use the()
method writes to a file, file-like object, or URL (via FTP PUT or HTTP POST). Both calls accept the same keyword arguments, such aspretty_print
for formatting output, orencoding
Used to select a specific output encoding other than pure ASCII:
>>> root = ('<root><a><b/></a></root>')
>>> (root)
b'<root><a><b/></a></root>'
>>> xml_string = (root, xml_declaration=True)
>>> print(xml_string.decode(), end='')
<?xml version='1.0' encoding='ASCII'?>
<root><a><b/></a></root>
>>> latin1_bytesstring = (root, encoding='iso8859-1')
>>> print(latin1_bytesstring.decode('iso8859-1'), end='')
<?xml version='1.0' encoding='iso8859-1'?>
<root><a><b/></a></root>
>>> print((root, pretty_print=True).decode(), end='')
<root>
<a>
<b/>
</a>
</root>
Please note that the aesthetically pleasing print (pretty_print
) will append a new line at the end. Therefore, we use theend=''
option to prevent theprint()
function adds another newline character.
In order to print on the aesthetics before serialization (pretty_print
) for finer-grained control, you can use theindent()
function (added in lxml 4.5) adds whitespace indentation to the tree:
>>> root = ('<root><a><b/>\n</a></root>')
>>> print((root).decode())
<root><a><b/>
</a></root>
>>> (root)
>>> print((root).decode())
<root>
<a>
<b/>
</a>
</root>
>>>
'\n '
>>> root[0].text
'\n '
>>> (root, space=" ")
>>> print((root).decode())
<root>
<a>
<b/>
</a>
</root>
>>> (root, space="\t")
>>> (root)
b'<root>\n\t<a>\n\t\t<b/>\n\t</a>\n</root>'
In lxml 2.0 and later as well, serialization functions can do more than just XML serialization. You can serialize to HTML or extract text content by passing the method keyword:
>>> root = (
... '<html><head/><body><p>Hello<br/>World</p></body></html>')
>>> (root) # default: method = 'xml'
b'<html><head/><body><p>Hello<br/>World</p></body></html>'
>>> (root, method='xml') # same as above
b'<html><head/><body><p>Hello<br/>World</p></body></html>'
>>> (root, method='html')
b'<html><head></head><body><p>Hello<br>World</p></body></html>'
>>> prettyprint(root, method='html')
<html>
<head></head>
<body><p>Hello<br>World</p></body>
</html>
>>> (root, method='text')
b'HelloWorld'
As with XML serialization, the default encoding for plain text serialization is ASCII:
>>> br = next(('br')) # get first result of iteration
>>> = 'Wörld'
>>> (root, method='text') # doctest: +ELLIPSIS
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' ...
>>> (root, method='text', encoding="UTF-8")
b'HelloW\xc3\xb6rld'
It may be convenient here to set the serialization target to a Python text string instead of a byte string. Simply pass 'unicode' as the encoding:
>>> (root, encoding='unicode', method='text')
'HelloWörld'
>>> (root, encoding='unicode')
'<html><head/><body><p>Hello<br/>Wörld</p></body></html>'
The W3C has a good article on Unicode character sets and character encodings:https:///International/tutorials/tutorial-char-enc/
The ElementTree class
ElementTree
Primarily a document wrapper around a tree with a root node. It provides a number of methods for serialization and general document processing.
root = ('''<?xml version="1.0"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
<root>
<a>&tasty;</a>
</root>
''')
tree = (root)
print(.xml_version)
1.0
print()
<!DOCTYPE root SYSTEM "test">
.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN'
.system_url = 'file://'
print()
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://">
When you call theparse()
When the function parses a file or file-like object (see the Parsing section below), you also get an ElementTree.
An important difference is thatElementTree
Classes are serialized as a complete document, rather than a singleElement
. This includes top-level processing instructions and comments, as well as DOCTYPE and other DTD content in documents:
>>> prettyprint(tree) # lxml 1.3.4 and later
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://" [
<!ENTITY tasty "parsnips">
]>
<root>
<a>parsnips</a>
</root>
In the original implementation and up to version 1.3.3 of lxml, the output looks the same as when serializing only the root element:
>>> prettyprint(())
<root>
<a>parsnips</a>
</root>
This serialization behavior changed in lxml 1.3.4. Previously, trees were serialized without DTD content, which caused lxml to lose DTD information in input-output loops.
Parsing from strings and files
Support for parsing XML in a variety of ways and from all important sources, i.e. strings, files, URLs (http/ftp) and file-like objects. The main parsing functions arefromstring()
respond in singingparse()
, are called with the source as the first argument. By default, they use the standard parser, but you can always pass a different parser as the second argument.
The fromstring() function
fromstring()
function is the easiest way to parse a string:
>>> some_xml_data = "<root>data</root>"
>>> root = (some_xml_data)
>>> print()
root
>>> (root)
b'<root>data</root>'
print(type(root))
# <class '._Element'>
The XML() function
XML()
The behavior of the function is similar to that of thefromstring()
function, but is usually used to write XML literals directly to source code:
>>> root = ("<root>data</root>")
>>> print()
root
>>> (root)
b'<root>data</root>'
print(type(root))
# <class '._Element'>
There is also a corresponding functionHTML()
Used for HTML literals.
>>> root = ("<p>data</p>")
>>> (root)
b'<html><body><p>data</p></body></html>'
print(type(root))
# <class '._Element'>
The parse() function
parse()
function is used to parse from files and file-like objects.
As an example of such a class file object, the following code uses the BytesIO class to read from a string instead of an external file. However, in real life, you would obviously avoid doing this and instead use something like the above mentionedfromstring()
Such a string parsing function.
>>> from io import BytesIO
>>> some_file_or_file_like_object = BytesIO(b"<root>data</root>")
>>> tree = (some_file_or_file_like_object)
>>> (tree)
b'<root>data</root>'
Please note.parse()
Returns aElementTree
object, rather than a string parsing function like theElement
Object:
print(type(tree))
# <class '._ElementTree'>
>>> root = ()
>>> print()
root
>>> (root)
b'<root>data</root>'
The reason behind this difference is that parse() returns a complete document from a file, whereas string parsing functions are typically used to parse XML fragments.
parse()
The function supports any of the following sources:
- An open file object (make sure to open in binary mode)
- A person with
.read(byte_count)
method of the class file object, each call returns a byte string - A filename string
- An HTTP or FTP URL string.
Note that passing a filename or URL is usually faster than passing an open file or file-like object. However, the HTTP/FTP client in libxml2 is fairly simple, so things like HTTP authentication require a specialized URL request library such as urllib2 or requests.These libraries typically provide a file-like object as a result from which you can parse in response to a streaming transfer.
Parser objects
By default, a standard parser with default settings is used. If you want to configure the parser, you can create a new instance:
parser = (remove_blank_text=True) # only!
This creates a parser that removes white space text between tags as it parses, which reduces the size of the tree and avoids dangling trailing text when you know the white space doesn't make sense for your data. Example:
>>> root = ("<root> <a/> <b> </b> </root>", parser)
>>> (root)
b'<root><a/><b> </b></root>'
Please note.<b>
Blank content within tags is not removed because the content in leaf elements is often data content (even if it is blank). You can easily remove it by traversing the tree:
for element in ("*"):
if is not None and not ():
= None
(root)
b'<root><a/><b/></root>'
See alsohelp()
for information about available parser options.
help()
Incremental parsing
Two methods of incremental step-by-step parsing are provided. One is through a file-like object, which repeatedly calls the read() method. This is best used where the data comes from a source like urllib or any other file-like object (which can provide data on request). Note that in this case the parser blocks and waits for the data to become available:
class DataSource:
data = [ b"<roo", b"t><", b"a/", b"><", b"/root>" ]
def read(self, requested_size):
try:
return (0)
except IndexError:
return b''
tree = (DataSource())
(tree)
b'<root><a/></root>'
The second method is through the feed(data) and close() methods provided by parser:
parser = ()
("<roo")
("t><")
("a/")
("><")
("/root>")
root = ()
(root)
b'<root><a/></root>'
Here, you can interrupt the parsing process at any time, and later on, by calling again thefeed()
method to continue parsing. This is useful when you want to avoid blocking calls to the parser, such as in frameworks like Twisted, or whenever data arrives slowly or in chunks and you want to do something else while you wait for the next chunk.
After calling the close() method (or if the parser raises an exception), you can reuse the parser by calling its feed() method again:
("<root/>")
root = ()
(root)
b'<root/>'
Event-driven parsing
Sometimes what you need from a document is only a small part of the tree deep inside, so parsing the entire tree into memory, traversing it, and then discarding it may be too much overhead. This use case is supported through two event-driven parser interfaces, one that generates parser events when building the tree (iterparse
), the other does not build the tree at all, but calls the feedback method on the target object in a SAX-like manner.
Here's a simpleiterparse()
Example:
some_file_like = BytesIO(b"<root><a>data</a></root>")
for event, element in (some_file_like):
print(f"{event}, {:>4}, {}")
end, a, data
end, root, None
By default, iterparse() generates an event only when it finishes parsing an element, but you can control this with the events keyword argument:
some_file_like = BytesIO(b"<root><a>data</a></root>")
for event, element in (some_file_like,
events=("start", "end")):
print(f"{event:>5}, {:>4}, {}")
start, root, None
start, a, data
end, a, data
end, root, None
Please note that when receivingstart
The element's text, trailing text, and child elements are not necessarily already present at the time of the event. Only theend
event ensures that the element has been fully parsed.
It also allows you to use the.clear()
methods or modify the contents of elements to save memory. Therefore, if you parse a large tree and you want to keep memory usage small, you should clean up parts of the tree that are no longer needed..clear()
methodologicalkeep_tail=True
parameter ensures that the (trailing) text content behind the current element is not touched. It is strongly discouraged to modify any content that the parser may not have fully read.
some_file_like = BytesIO(b"<root><a><b>data</b></a><a><b/></a></root>")
for event, element in (some_file_like):
if == 'b':
print()
elif == 'a':
print("** cleaning up the subtree")
(keep_tail=True)
data
** cleaning up the subtree
None
** cleaning up the subtree
A very important use case for iterparse() is parsing large generated XML files, such as database dumps. Most often, these XML formats have only one major data item element hanging directly under the root node, and that element is repeated thousands of times. In this case, the best practice is to have thePerforms tree construction and intercepts only this one element, using the normal tree API for data extraction.
xml_file = BytesIO(b'''
<root>
<a><b>ABC</b><c>abc</c></a>
<a><b>MORE DATA</b><c>more data</c></a>
<a><b>XYZ</b><c>xyz</c></a>
</root>''')
for _, element in (xml_file, tag='a'):
print('%s -- %s' % (('b'), element[1].text))
(keep_tail=True)
ABC -- abc
MORE DATA -- more data
XYZ -- xyz
If, for some reason, you don't want to build a tree at all, you can use the target parser interface. It creates SAX-like events by calling methods on the target object. By implementing some or all of these methods, you can control which events are generated:
class ParserTarget:
events = []
close_count = 0
def start(self, tag, attrib):
(('start', tag, attrib))
def close(self):
events, = , []
self.close_count += 1
return events
parser_target = ParserTarget()
parser = (target=parser_target)
events = ('<root test="true"/>', parser)
print(parser_target.close_count)
1
event: start - tag: root
* test = true
You can reuse the parser and its targets as much as you want, so you should make sure that the.close()
method does reset the target to a usable state (even in the case of an error!) .
>>> events = ('<root test="true"/>', parser)
>>> print(parser_target.close_count)
2
>>> events = ('<root test="true"/>', parser)
>>> print(parser_target.close_count)
3
>>> events = ('<root test="true"/>', parser)
>>> print(parser_target.close_count)
4
>>> for event in events:
... print(f'event: {event[0]} - tag: {event[1]}')
... for attr, value in event[2].items():
... print(f' * {attr} = {value}')
event: start - tag: root
* test = true
Namespaces
Whenever possible, the ElementTree API avoids using thenamespace prefix, instead using real namespaces (URIs):
>>> xhtml = ("{http:///1999/xhtml}html")
>>> body = (xhtml, "{http:///1999/xhtml}body")
>>> = "Hello World"
>>> prettyprint(xhtml)
<html:html xmlns:html="http:///1999/xhtml">
<html:body>Hello World</html:body>
</html:html>
The representation used by ElementTree was originally proposed by James Clark. Its main advantage is that it provides a universally qualified name for tags, regardless of any prefixes that may have been used or defined in the document. By removing the indirection of prefixes, it makes namespace-aware code clearer and easier to handle correctly.
As you can see from the example, prefixes only become important when serializing results. However, the above code looks a bit lengthy due to the long namespace names. Moreover, retyping or copying strings over and over again is error prone. Therefore, it is common practice to store namespace URIs in global variables. To adapt (adapt) the namespace prefixes used for serialization, you can also pass a map to theElement
Factory functions, e.g. to define default namespaces:
>>> XHTML_NAMESPACE = "http:///1999/xhtml"
>>> XHTML = "{%s}" % XHTML_NAMESPACE
>>> NSMAP = {None : XHTML_NAMESPACE} # the default namespace (no prefix)
>>> xhtml = (XHTML + "html", nsmap=NSMAP) # lxml only!
>>> body = (xhtml, XHTML + "body")
>>> = "Hello World"
>>> prettyprint(xhtml)
<html xmlns="http:///1999/xhtml">
<body>Hello World</body>
</html>
You can also use theQName
Helper classes to build or split qualified tag names.
>>> tag = ('http:///1999/xhtml', 'html')
>>> print()
html
>>> print()
http:///1999/xhtml
>>> print()
{http:///1999/xhtml}html
>>> tag = ('{http:///1999/xhtml}html')
>>> print()
html
>>> print()
http:///1999/xhtml
>>> root = ('{http:///1999/xhtml}html')
>>> tag = (root)
>>> print()
html
>>> tag = (root, 'script')
>>> print()
{http:///1999/xhtml}script
>>> tag = ('{http:///1999/xhtml}html', 'script')
>>> print()
{http:///1999/xhtml}script
Allows you to pass the.nsmap
attribute looks up the current namespace defined for the node:
>>>
{None: 'http:///1999/xhtml'}
Note that this includes all prefixes known in the context of the element, not just those it defines itself.
root = ('root', nsmap={'a': '/c'})
child = (root, 'child',
nsmap={'b': '/d'})
print()
{'a': '[/c](/c)'}
len()
# 1
print()
{'b': '[/d](/d)', 'a': '[/c](/c)'}
len()
['a']
# '/c'
['b']
# '/d'
Therefore, modifying the returned dictionary has no meaningful effect on the Element. Any changes to it are ignored.
Namespaces on attributes work in a similar way, but as of version 2.3 it will be ensured that attributes are declared using namespaces with prefixes. This is because the XML namespace specification (Section 6.2) considers unprefixed attribute names not to be in any namespace, so even if they appear in a namespace element, they may lose their namespace in the serialization-parsing loop.
(XHTML + "bgcolor", "#CCFFAA")
prettyprint(xhtml)
<html xmlns="http:///1999/xhtml">
<body xmlns:html="http:///1999/xhtml" html:bgcolor="#CCFFAA">Hello World</body>
</html>
# The XML namespace specification considers unprefixed attribute names not to be in any namespace, so returns None
print(("bgcolor"))
None
# Use prefixed attribute names
(XHTML + "bgcolor")
'#CCFFAA'
You can also use XPath with fully qualified names:
# Review the xhtml first
print((xhtml).decode())
<html xmlns="http:///1999/xhtml"><body xmlns:html="http:///1999/xhtml" html:bgcolor="#CCFFAA" bgcolor="#CCFFAA">Hello World</body></html>
>>> find_xhtml_body = ( # lxml only !
... "//{%s}body" % XHTML_NAMESPACE)
>>> results = find_xhtml_body(xhtml)
>>> print(results[0].tag)
{http:///1999/xhtml}body
For convenience, you can use "*
" wildcard, either for tag names or namespaces:
>>> for el in ('*'): print() # any element
{http:///1999/xhtml}html
{http:///1999/xhtml}body
>>> for el in ('{http:///1999/xhtml}*'): print()
{http:///1999/xhtml}html
{http:///1999/xhtml}body
>>> for el in ('{*}body'): print()
{http:///1999/xhtml}body
To find elements without namespaces, use plain tag names, or explicitly provide empty namespaces:
>>> [ for el in ('{http:///1999/xhtml}body') ]
['{http:///1999/xhtml}body']
>>> [ for el in ('body') ]
[]
>>> [ for el in ('{}body') ]
[]
>>> [ for el in ('{}*') ]
[]
The E-factory
E-factory provides a simple and compact syntax for generating XML and HTML:
from import E
def CLASS(*args): # class is a reserved word in Python
return {"class":' '.join(args)}
html = page = (
(
(
("This is a sample document")
),
(
E.h1("Hello!", CLASS("title")),
("This is a paragraph with ", ("bold"), " text in it!"),
("This is another paragraph, with a", "\n ",
("link", href=""), "."),
("Here are some reserved characters: <spam&egg>."),
("<p>And finally an embedded XHTML fragment.</p>"),
)
)
)
prettyprint(page)
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1 class="title">Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
<a href="">link</a>.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
<p>And finally an embedded XHTML fragment.</p>
</body>
</html>
Attribute access based element creation makes it easy to build a simple vocabulary for the XML language.
from import ElementMaker # lxml only!
E = ElementMaker(namespace="/fault/namespace", nsmap={'p': "/fault/namespace"})
DOC =
TITLE =
SECTION =
PAR =
my_doc = DOC(
TITLE("The dog and the hog"),
SECTION(
TITLE("The dog"),
PAR("Once upon a time, ..."),
PAR("And then ...")
),
SECTION(
TITLE("The hog"),
PAR("Sooner or later ...")
)
)
prettyprint(my_doc)
<p:doc xmlns:p="/fault/namespace">
<p:title>The dog and the hog</p:title>
<p:section>
<p:title>The dog</p:title>
<p:par>Once upon a time, ...</p:par>
<p:par>And then ...</p:par>
</p:section>
<p:section>
<p:title>The hog</p:title>
<p:par>Sooner or later ...</p:par>
</p:section>
</p:doc>
One such example is the module, which provides a vocabulary for HTML.
When dealing with multiple namespaces, the best practice is to define an ElementMaker for each namespace URI. note again how the above example predefines tag builders in the naming constants. This makes it easy to put all tag declarations for a namespace into a Python module and import/use the tag name constants from there. This avoids pitfalls such as misspellings or accidentally missing namespaces.
ElementPath
The ElementTree library comes with a simple XPath-like path language called ElementPath.The main difference is that you can use the {namespace}tag representation in ElementPath expressions. However, advanced features such as value comparison and functions are not available.
apart fromComplete XPath implementationThe API supports the ElementPath language in the same way as ElementTree, and even uses (almost) the same implementation.The API provides four methods here, which you can find on Elements and ElementTrees:
-
iterfind()
Iterates over all elements that match the path expression. -
findall()
Returns a list of matching elements. -
find()
Efficiently returns only the first match. -
findtext()
Returns the first match of the.text
Content.
Here are some examples:
root = ("<root><a x='123'>aText<b/><c/><b/></a></root>")
Finds the child elements of an element:
>>> print(("b"))
None
>>> print(("a").tag)
a
Finds an element in the tree:
>>> print((".//b").tag)
b
>>> [ for b in (".//b") ]
['b', 'b']
Finds elements with specific attributes:
>>> print((".//a[@x]")[0].tag)
a
>>> print((".//a[@y]"))
[]
existlxml
In version 3.4, there is a new helper function for adding a new function to aElement
Generate structuredElementPath
Expressions.
>>> tree = (root)
>>> a = root[0]
>>> print((a[0]))
a/b[1]
>>> print((a[1]))
a/c
>>> print((a[2]))
a/b[2]
>>> ((a[2])) == a[2]
True
As long as the tree has not been modified, this path expression represents the identifier of the given element, which can be found later in the same tree using find(). The advantage of ElementPath expressions over XPath is that they are self-inclusive even for documents that use namespaces.
.iter()
method is a special case that looks for a specific label in the tree by name only, not based on a path. This means that the following commands are equivalent in case of success:
>>> print((".//b").tag)
b
>>> print(next((".//b")).tag)
b
>>> print(next(("b")).tag)
b