parsing,

Simple CLI command to extract URLs from the sitemap.xml file

Jan 05, 2023 · 2 mins read · Post a comment

I wanted to do some analytics automation so, at first I needed to pull a list of all blog post links (URLs) from the sitemap.xml file. And here’s what I came up with.

Prerequisites

  • curl
  • sed

Solution

Let’s take the DevCoops sitemap.xml. file as an example.

Step 1. First, GET the sitemap.xml file using curl.

curl https://devcoops.com/sitemap.xml

Example:

...
<url>
<loc>https://devcoops.com/install-bc-linux-macos-windows/</loc>
<lastmod>2022-03-19T00:00:00+01:00</lastmod>
</url>
<url>
<loc>https://devcoops.com/repair-and-optimize-mysql-databases/</loc>
<lastmod>2022-03-20T00:00:00+01:00</lastmod>
</url>
...

Step 2. Remove everything that’s not starting / ending with the <loc> / </loc> tags using sed text stream editor.

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' 

Example:

...
<loc>https://devcoops.com/install-bc-linux-macos-windows/</loc>
<loc>https://devcoops.com/repair-and-optimize-mysql-databases/</loc>
...

Step 3. Get rid of the tags.

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' 

Example:

...
https://devcoops.com/install-bc-linux-macos-windows/
https://devcoops.com/repair-and-optimize-mysql-databases/
...

Step 4. Save the output in a file. For instance:

curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' > sitemap_results.txt

Bonus tip(s):

  1. Since sitemap.xml stores the “extra” Jekyll pages that aren’t posts including: /categories/, /tags/, /contact/ and /privacy-policy/, to remove them, being written as the last 4 lines of the file, run the following command instead:
    curl https://devcoops.com/sitemap.xml | sed '/^<loc>/!d' | sed -e 's/<[^>]*>//g' | ghead -n -4 > sitemap_results.txt
    
  2. If working on macOS, use ghead instead of head as the latter doesn’t support negative line counts. To install ghead, run:
    brew install coreutils
    

Conclusion

Using regular expressions for parsing XML is strongly discouraged since it’s a hard thing to do in practice. There are better ways and tools for sure.

Feel free to leave a comment below and if you find this tutorial useful, follow our official channel on Telegram.