How to Read Xml Response in Python
Processing XML in Python — ElementTree
A Beginner'south Guide
Learn how y'all can parse, explore, modify and populate XML files with the Python ElementTree packet, for loops and XPath expressions. Every bit a data scientist, you'll find that agreement XML is powerful for both web-scraping and full general practice in parsing a structured document
Extensible Markup Language (XML) is a markup language which encodes documents by defining a fix of rules in both machine-readable and homo-readable format. Extended from SGML (Standard Generalized Markup Linguistic communication), it lets us describe the structure of the document. In XML, we can ascertain custom tags. We tin too use XML as a standard format to exchange data.
- XML documents take sections, called elements , defined past a beginning and an ending tag . A tag is a markup construct that begins with
<and ends with>. The characters between the offset-tag and end-tag, if there are whatever, are the element'south content. Elements can contain markup, including other elements, which are called "child elements". - The largest, top-level element is chosen the root , which contains all other elements.
- Attributes are name–value pair that be within a start-tag or empty-element tag. An XML attribute can merely accept a unmarried value and each attribute tin appear at most once on each element.
Here'southward a snapshot of movies.xml that nosotros will exist using for this tutorial:
<?xml version="i.0"?>
<collection>
<genre category="Action">
<decade years="1980s">
<film favorite="Truthful" title="Indiana Jones: The raiders of the lost Ark">
<format multiple="No">DVD</format>
<year>1981</yr>
<rating>PG</rating>
<description>
'Archeologist and charlatan Indiana Jones
is hired past the U.S. regime to find the Ark of the Covenant before the Nazis.'
</description>
</moving picture>
<movie favorite="Truthful" championship="THE KARATE Kid">
<format multiple="Yes">DVD,Online</format>
<year>1984</year>
<rating>PG</rating>
<description>None provided.</description>
</pic>
<movie favorite="Simulated" title="Dorsum two the Future">
<format multiple="Faux">Blu-ray</format>
<yr>1985</year>
<rating>PG</rating>
<description>Marty McFly</clarification>
</film>
</decade>
<decade years="1990s">
<movie favorite="False" title="X-Men">
<format multiple="Yeah">dvd, digital</format>
<year>2000</yr>
<rating>PG-13</rating>
<clarification>Ii mutants come to a private academy for their kind whose resident superhero squad must oppose a terrorist system with like powers.</clarification>
</movie>
<flick favorite="True" championship="Batman Returns">
<format multiple="No">VHS</format>
<year>1992</yr>
<rating>PG13</rating>
<clarification>NA.</description>
</motion-picture show>
<movie favorite="False" title="Reservoir Dogs">
<format multiple="No">Online</format>
<twelvemonth>1992</year>
<rating>R</rating>
<description>Any I Want!!!?!</clarification>
</movie>
</decade>
</genre> <genre category="Thriller">
<decade years="1970s">
<motion-picture show favorite="Imitation" title="Alien">
<format multiple="Yep">DVD</format>
<year>1979</year>
<rating>R</rating>
<description>"""""""""</description>
</moving-picture show>
</decade>
<decade years="1980s">
<movie favorite="True" title="Ferris Bueller's Day Off">
<format multiple="No">DVD</format>
<twelvemonth>1986</year>
<rating>PG13</rating>
<description>Funny film on funny guy </description>
</pic>
<picture favorite="FALSE" title="American Psycho">
<format multiple="No">blue-ray</format>
<year>2000</year>
<rating>Unrated</rating>
<description>psychopathic Bateman</description>
</moving-picture show>
</decade>
</genre>
Introduction to ElementTree
The XML tree structure makes navigation, modification, and removal relatively elementary programmatically. Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).
First, import ElementTree. It'south a common practice to use the allonym of ET:
import xml.etree.ElementTree as ET Parsing XML Information
In the XML file provided, at that place is a basic collection of movies described. The only problem is the information is a mess! There accept been a lot of different curators of this collection and anybody has their own way of inbound data into the file. The primary goal in this tutorial will exist to read and understand the file with Python — and so ready the problems.
First y'all need to read in the file with ElementTree.
tree = ET.parse('movies.xml')
root = tree.getroot() Now that you take initialized the tree, you should look at the XML and print out values in guild to sympathize how the tree is structured.
root.tag 'collection'
At the top level, you run into that this XML is rooted in the drove tag.
root.attrib {}
For Loops
Y'all can easily iterate over subelements (commonly chosen "children") in the root by using a simple "for" loop.
for child in root:
impress(child.tag, kid.attrib) genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}
Now you know that the children of the root collection are all genre. To designate the genre, the XML uses the attribute category. At that place are Action, Thriller, and Comedy movies co-ordinate the genre chemical element.
Typically it is helpful to know all the elements in the entire tree. Ane useful part for doing that is root.iter().
[elem.tag for elem in root.iter()] ['collection',
'genre',
'decade',
'moving picture',
'format',
'twelvemonth',
'rating',
'description',
'movie',
.
.
.
.
'movie',
'format',
'twelvemonth',
'rating',
'description']
There is a helpful way to run into the whole certificate. If yous pass the root into the .tostring() method, you can return the whole document. Within ElementTree, this method takes a slightly strange form.
Since ElementTree is a powerful library that tin can interpret more merely XML, you lot must specify both the encoding and decoding of the document you are displaying as the string.
You can expand the utilise of the iter() function to help with finding particular elements of interest. root.iter() will listing all subelements under the root that match the element specified. Here, you will listing all attributes of the motion picture chemical element in the tree:
for motion-picture show in root.iter('movie'):
print(movie.attrib) {'favorite': 'Truthful', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Hereafter'}
{'favorite': 'Simulated', 'championship': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'Simulated', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'Imitation', 'title': 'Batman: The Moving-picture show'}
{'favorite': 'True', 'championship': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'Truthful', 'title': 'Robin Hood: Prince of Thieves'}
XPath Expressions
Many times elements will not have attributes, they volition but have text content. Using the attribute .text, you can print out this content.
Now, impress out all the descriptions of the movies.
for description in root.iter('description'):
print(clarification.text) 'Archaeologist and charlatan Indiana Jones is hired by the U.South. government to detect the Ark of the Covenant before the Nazis.' None provided.
Marty McFly
Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.
NA.
Any I Want!!!?!
"""""""""
Funny movie about a funny guy
psychopathic Bateman
What a joke!
Emma Rock = Hester Prynne
Tim (Rudd) is a ascension executive who "succeeds" in finding the perfect guest, IRS employee Barry (Carell), for his boss' monthly issue, a so-chosen "dinner for idiots," which offers certain
advantages to the exec who shows up with the biggest buffoon. Who ya gonna call?
Robin Hood slaying
Printing out the XML is helpful, but XPath is a query language used to search through an XML quickly and easily. However, Understanding XPath is critically of import to scanning and populating XMLs. ElementTree has a .findall() function that volition traverse the firsthand children of the referenced element.
Here, you volition search the tree for movies that came out in 1992:
for movie in root.findall("./genre/decade/moving-picture show/[year='1992']"):
print(motion picture.attrib) {'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
The function .findall() ever begins at the element specified. This blazon of part is extremely powerful for a "notice and replace". You can even search on attributes!
Now, print out just the movies that are bachelor in multiple formats (an attribute).
for movie in root.findall("./genre/decade/movie/format/[@multiple='Aye']"):
print(movie.attrib) {'multiple': 'Yeah'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
Brainstorm why, in this instance, the print statement returns the "Yes" values of multiple. Think about how the "for" loop is defined.
Tip: employ '...' inside of XPath to return the parent chemical element of the current element.
for picture show in root.findall("./genre/decade/movie/format[@multiple='Yes']..."):
print(flick.attrib) {'favorite': 'True', 'title': 'THE KARATE Child'}
{'favorite': 'Imitation', 'title': 'X-Men'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'championship': 'Dinner for SCHMUCKS'}
Modifying an XML
Before, the movie titles were an absolute mess. Now, print them out again:
for picture in root.iter('movie'):
impress(movie.attrib) {'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'Truthful', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back ii the Hereafter'}
{'favorite': 'Fake', 'championship': 'X-Men'}
{'favorite': 'True', 'championship': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'Alien'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'Simulated', 'title': 'Batman: The Picture'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'Truthful', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}
Set up the 'ii' in Back 2 the Future. That should be a find and replace problem. Write code to find the championship 'Back ii the Future' and relieve it every bit a variable:
b2tf = root.observe("./genre/decade/moving-picture show[@title='Back 2 the Future']")
print(b2tf) <Element 'film' at 0x10ce00ef8>
Notice that using the .find() method returns an element of the tree. Much of the time, it is more useful to edit the content within an element.
Modify the championship attribute of the Back 2 the Future element variable to read "Back to the Future". Then, print out the attributes of your variable to see your change. You can easily do this by accessing the attribute of an chemical element and then assigning a new value to it:
b2tf.attrib["title"] = "Back to the Future"
print(b2tf.attrib) {'favorite': 'False', 'title': 'Dorsum to the Future'}
Write out your changes back to the XML so they are permanently stock-still in the document. Print out your picture show attributes once again to make sure your changes worked. Use the .write() method to do this:
tree.write("movies.xml") tree = ET.parse('movies.xml')
root = tree.getroot() for picture show in root.iter('flick'):
print(movie.attrib) {'favorite': 'Truthful', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Dorsum to the Future'}
{'favorite': 'Faux', 'title': 'X-Men'}
{'favorite': 'True', 'championship': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller'south Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'Fake', 'title': 'Batman: The Motion-picture show'}
{'favorite': 'True', 'championship': 'Like shooting fish in a barrel A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'Fake', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}
Fixing Attributes
The multiple attribute is wrong in some places. Use ElementTree to gear up the designator based on how many formats the film comes in. Get-go, print the formatattribute and text to see which parts need to be fixed.
for form in root.findall("./genre/decade/movie/format"):
print(class.attrib, form.text) {'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'False'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'Yep'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'No'} Online,VHS
{'multiple': 'No'} Blu_Ray
There is some work that needs to be washed on this tag.
Yous tin can use regex to find commas — that volition tell whether the multiple aspect should be "Yes" or "No". Calculation and modifying attributes tin can be done hands with the .set()method.
import re for form in root.findall("./genre/decade/movie/format"):
# Search for the commas in the format text
match = re.search(',',form.text)
if match:
grade.set('multiple','Yes')
else:
form.set('multiple','No') # Write out the tree to the file once more
tree.write("movies.xml") tree = ET.parse('movies.xml')
root = tree.getroot() for grade in root.findall("./genre/decade/movie/format"):
impress(form.attrib, form.text) {'multiple': 'No'} DVD
{'multiple': 'Yep'} DVD,Online
{'multiple': 'No'} Blu-ray
{'multiple': 'Aye'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'No'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} bluish-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Aye'} DVD,digital,Netflix
{'multiple': 'Aye'} Online,VHS
{'multiple': 'No'} Blu_Ray
Moving Elements
Some of the data has been placed in the incorrect decade. Employ what yous have learned about XML and ElementTree to observe and ready the decade data errors.
Information technology volition be useful to print out both the decade tags and the year tags throughout the document.
for decade in root.findall("./genre/decade"):
print(decade.attrib)
for year in decade.findall("./moving picture/year"):
print(year.text) {'years': '1980s'}
1981
1984
1985
{'years': '1990s'}
2000
1992
1992
{'years': '1970s'}
1979
{'years': '1980s'}
1986
2000
{'years': '1960s'}
1966
{'years': '2010s'}
2010
2011
{'years': '1980s'}
1984
{'years': '1990s'}
1991
The two years that are in the wrong decade are the movies from the 2000s. Effigy out what those movies are, using an XPath expression.
for movie in root.findall("./genre/decade/flick/[year='2000']"):
print(movie.attrib) {'favorite': 'Fake', 'title': 'Ten-Men'}
{'favorite': 'Simulated', 'title': 'American Psycho'}
You have to add a new decade tag, the 2000s, to the Action genre in order to motility the X-Men information. The .SubElement() method tin be used to add together this tag to the end of the XML.
action = root.find("./genre[@category='Action']")
new_dec = ET.SubElement(action, 'decade')
new_dec.attrib["years"] = '2000s' Now append the 10-Men movie to the 2000s and remove it from the 1990s, using .suspend() and .remove(), respectively.
xmen = root.find("./genre/decade/movie[@championship='X-Men']")
dec2000s = root.find("./genre[@category='Action']/decade[@years='2000s']")
dec2000s.append(xmen)
dec1990s = root.detect("./genre[@category='Action']/decade[@years='1990s']")
dec1990s.remove(xmen) Build XML Documents
Nice, and so y'all were able to substantially move an entire movie to a new decade. Save your changes back to the XML.
tree.write("movies.xml") tree = ET.parse('movies.xml')
root = tree.getroot() print(ET.tostring(root, encoding='utf8').decode('utf8'))
Decision
ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML certificate in a tree structure that is piece of cake to work with. When in doubt, impress it out (print(ET.tostring(root, encoding='utf8').decode('utf8'))) - employ this helpful print statement to view the entire XML document at in one case.
References
- Original Post as published past Steph Howson: Datacamp
- Python three Documentation: ElementTree
- Wikipedia: XML
Source: https://towardsdatascience.com/processing-xml-in-python-elementtree-c8992941efd2
Mag-post ng isang Komento for "How to Read Xml Response in Python"