For websites such as news and blogging sites with new content being added every day, it becomes cumbersome to manually update the sitemaps for each newly added page. This is how I have automated the process for a serverless NextJS website hosted on AWS that I manage.
Create an S3 bucket on AWS that stores only the sitemap(s) and allow public read to the bucket or grant read access to the source IP of your NextJS server if not serverless. This is a sample AWS S3 bucket policy for allowing only public read to the bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicRead",
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::<bucket name>/*"
}
]
}
This is a sample sitemap that stores some links within a website that we want search engines to crawl.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://teachsomebody.com/</loc>
</url>
<url>
<loc>https://teachsomebody.com/course/home</loc>
</url>
<url>
<loc>https://teachsomebody.com/blog/home</loc>
</url>
<url>
<loc>https://teachsomebody.com/course/view/programming-thinking/JiM4dIfE-nCmdcx43tNmx</loc>
</url>
</urlset>
Let us store this file with the name ‘sitemap_s3.xml’ in our AWS S3 bucket.
In our NextJs app, let us create a page called sitemap.xml.js which reads the ‘sitemap_s3.xml’ file from S3 on request.
import React from 'react'
function Sitemap extends React.Component {
static async getInitialProps({ res }) {
const request = await fetch('https://<S3 bucket URL>/sitemap_s3.xml');
const links = await request.text();
res.setHeader('Content-Type', 'text/xml');
res.write(links);
res.end();
}
}
export default Sitemap
The code above reads the contents of the sitemap_s3.xml from AWS S3 and sends the contents as a response to the request from the search engine crawler.
Consistently updating the sitemap becomes a bit complicated if you have a website or platform where different users can create and publish new pages at anytime. The complication arises from one process overwriting the lines being written by the other at the same time, leading to missing and incomplete updates.
To solve this problem, we save the newly published page URLs (to be appended to the sitemap) in a database (AWS DynamoDB) and allow only a scheduled lambda function to update the sitemap at specific time intervals.
The lambda function goes through these steps to update a sitemap:
import os
import boto3
import json
from botocore.exceptions import ClientError
from boto3.dynamodb.conditions import Key, Attr
from bs4 import BeautifulSoup
try:
urls_resp = sitemap_table.scan()
if len(urls_resp['Items']) > 0:
# Retrieve sitemap from S3
sitemap_name = 'sitemap_s3.xml'
tmp_file = '/tmp/' + sitemap_name
s3_client.download_file('<S3 bucket name>', sitemap_name, tmp_file)
f_to_read = open(tmp_file, 'r')
soup = BeautifulSoup(f_to_read.read(),'lxml')
f_to_read.close()
for item in urls_resp['Items']:
# Check if url does not already in list
url_to_append = item['url']
locs = soup.find_all('loc', string=url_to_append)
if len(locs) == 0:
new_url = soup.new_tag('url')
new_loc = soup.new_tag('loc')
new_loc.string = url_to_append
new_url.insert(0, new_loc)
soup.urlset.append(new_url)
# Delete URL entry in database
sitemap_table.delete_item(
Key={
'id': item['id']
}
)
updated_sitemap = str(soup).replace("<html><body>", "").replace("</body></html>", "")
f_to_write = open(tmp_file, 'w')
f_to_write.write(updated_sitemap)
f_to_write.close()
s3_client.upload_file(tmp_file, 'all-sitemaps', sitemap_name)
# You may ping the search engine to indicate that your sitemap has been updated.
except ClientError as e:
print(str(e))
Note that you need to install the imported libraries in your virtual environment and setup the right credentials for the boto3 DynamoDB and S3 clients.
Written by:
Evans is a Computer Engineer and cloud technology enthusiast. He has a Masters degree in Embedded Systems (focusing on Software design) from the Technical University of Eindhoven (The Netherlands) and a Bachelor of Science in Electronic and Computer Engineering from the Polytechnic University of Turin (Italy). In addition, he has worked for the high-tech industry in the the Netherlands and other large corporations for over seven years.