teachsomebody

Hosting a dynamic NextJS sitemap in S3 and scheduling automatic updates

Tue Dec 01 2020

For websites such as news and blogging sites with new content being added every day, it becomes cumbersome to manually update the sitemaps for each newly added page. This is how I have automated the process for a serverless NextJS website hosted on AWS that I manage.

Create an S3 bucket with the initial version of the sitemap

Create an S3 bucket on AWS that stores only the sitemap(s) and allow public read to the bucket or grant read access to the source IP of your NextJS server if not serverless. This is a sample AWS S3 bucket policy for allowing only public read to the bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicRead",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<bucket name>/*"
        }
    ]
}

This is a sample sitemap that stores some links within a website that we want search engines to crawl.

<?xml version="1.0" encoding="UTF-8"?>
   <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
      <url>
         <loc>https://teachsomebody.com/</loc>
      </url>
      <url>
         <loc>https://teachsomebody.com/course/home</loc>
      </url>
      <url>
         <loc>https://teachsomebody.com/blog/home</loc>
      </url>
      <url>
         <loc>https://teachsomebody.com/course/view/programming-thinking/JiM4dIfE-nCmdcx43tNmx</loc>
      </url>
</urlset>

Let us store this file with the name ‘sitemap_s3.xml’ in our AWS S3 bucket.

Create a NextJS page that reads the sitemap from S3

In our NextJs app, let us create a page called sitemap.xml.js which reads the ‘sitemap_s3.xml’ file from S3 on request.

import React from 'react'

function Sitemap extends React.Component {
    static async getInitialProps({ res }) {
        const request = await fetch('https://<S3 bucket URL>/sitemap_s3.xml');
        const links = await request.text();
    
        res.setHeader('Content-Type', 'text/xml');
        res.write(links);
        res.end();
    }
}

export default Sitemap

The code above reads the contents of the sitemap_s3.xml from AWS S3 and sends the contents as a response to the request from the search engine crawler.

Automatically updating the sitemap

Consistently updating the sitemap becomes a bit complicated if you have a website or platform where different users can create and publish new pages at anytime. The complication arises from one process overwriting the lines being written by the other at the same time, leading to missing and incomplete updates.

To solve this problem, we save the newly published page URLs (to be appended to the sitemap) in a database (AWS DynamoDB) and allow only a scheduled lambda function to update the sitemap at specific time intervals.

The lambda function goes through these steps to update a sitemap:

Scan the DynamoDB table for new entries
If there are new entries, fetch and parse the current sitemap_s3.xml
Append the URLs in the table to the sitemap
Delete the table entries that were appended to the sitemap
Save the new sitemap_s3.xml to S3

import os
import boto3
import json
from botocore.exceptions import ClientError
from boto3.dynamodb.conditions import Key, Attr
from bs4 import BeautifulSoup

try:
   urls_resp = sitemap_table.scan()

   if len(urls_resp['Items']) > 0:
      # Retrieve sitemap from S3
      sitemap_name = 'sitemap_s3.xml'
      tmp_file = '/tmp/' + sitemap_name 
      s3_client.download_file('<S3 bucket name>', sitemap_name, tmp_file)
            
      f_to_read = open(tmp_file, 'r')
      soup = BeautifulSoup(f_to_read.read(),'lxml')
      f_to_read.close()

      for item in urls_resp['Items']:
         # Check if url does not already in list
         url_to_append = item['url']
         locs = soup.find_all('loc', string=url_to_append)

         if len(locs) == 0:
            new_url = soup.new_tag('url')
            new_loc = soup.new_tag('loc')
            new_loc.string = url_to_append
            new_url.insert(0, new_loc)
            soup.urlset.append(new_url)

         # Delete URL entry in database
         sitemap_table.delete_item(
               Key={
                     'id': item['id']
                   }
            )
                    
      updated_sitemap = str(soup).replace("<html><body>", "").replace("</body></html>", "")
      f_to_write = open(tmp_file, 'w')
      f_to_write.write(updated_sitemap)
      f_to_write.close()
      s3_client.upload_file(tmp_file, 'all-sitemaps', sitemap_name)
      
      # You may ping the search engine to indicate that your sitemap has been updated.
    except ClientError as e:
        print(str(e))

Note that you need to install the imported libraries in your virtual environment and setup the right credentials for the boto3 DynamoDB and S3 clients.

Written by:

Evans Boateng Owusu

Evans is a Computer Engineer and cloud technology enthusiast. He has a Masters degree in Embedded Systems (focusing on Software design) from the Technical University of Eindhoven (The Netherlands) and a Bachelor of Science in Electronic and Computer Engineering from the Polytechnic University of Turin (Italy). In addition, he has worked for the high-tech industry in the the Netherlands and other large corporations for over seven years.

Courses Live classes Blogs Discussions