parsing,

Extract XML tags and values from the CLI using sed

Jan 09, 2023 · 1 min read · Post a comment

Extracting certain tags and values using XML is not something we do on a daily basis. In my case, I had to figure out how to extract all DevCoops blog posts titles using the sitemap.xml file. And here’s how I’ve done it.

Prerequisites

  • sed

Solution

Step 1. First, I need the <loc> tag only. This is where the URLs are stored. To get them, run:

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d'
  • ^<loc>: match anything that starts with <loc>.
  • !d: do not delete.

Step 2. Next, remove the <loc> tags.

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g'
  • -e (optional): –expression=script. Used with one or multiple commands (scripts) without invoking more than one instance of sed.
  • s/<[^>]*>//g: removes any tag occurrences.

Conclusion

Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.

Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.