Extracting certain tags and values using XML is not something we do on a daily basis. In my case, I had to figure out how to extract all DevCoops blog posts titles using the
sitemap.xml file. And here’s how I’ve done it.
Step 1. First, I need the
<loc> tag only. This is where the URLs are stored. To get them, run:
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d'
^<loc>: match anything that starts with
!d: do not delete.
Step 2. Next, remove the
curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g'
-e(optional): –expression=script. Used with one or multiple commands (scripts) without invoking more than one instance of
s/<[^>]*>//g: removes any tag occurrences.
Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.
Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.