XML Site Maps
Site Maps are dynamically generated based on a publisher’s content. Two sitemaps are generated:
- Collection Index Sitemap– Located at the root domain, the sitemap index points to the sitemap of each collection owned by a publisher.
- Document Index Sitemap – located at the collection level, this sitemap will generate an entry for each document published in that collection.
Once the bot gets to the document, all content within that document is indexed.
Collection Index Sitemap Example:
Root Domain: https://www.yourdomain.com
Collection 1: https://www.yourdomain.com/collection1
Collection 2: https://www.yourdomain.com/collection2
Collection Index Sitemap location: https://www.yourdomain.com/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex version="2.0" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc> https://www.yourdomain.com/collection1/sitemap.xml </loc>
<lastmod> 2019-12-10T18: 17: 26.000Z </lastmod>
</sitemap>
<sitemap>
<loc> https://www.yourdomain.com/collection2/sitemap.xml </loc>
<lastmod> 2019-12-10T18: 17: 26.000Z </lastmod>
</sitemap>
</sitemapindex>
Document Index Sitemap Example:
For each collection, a sitemap is dynamically generated with a link to every document within that collection. For example, Collection 1 has 5 documents published:
<?xml version="1.0" encoding="UTF-8"?>
<urlset version="2.0" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc> https://www.yourdomain.com/collection1/document1 </loc>
<changefreq> weekly </changefreq>
</url>
<url>
<loc> https://www.yourdomain.com/collection1/document2 </loc>
<changefreq> weekly </changefreq>
</url>
<url>
<loc> https://www.yourdomain.com/collection1/document3 </loc>
<changefreq> weekly </changefreq>
</url>
<url>
<loc> https://www.yourdomain.com/collection1/document4 </loc>
<changefreq> weekly </changefreq>
</url>
<url>
<loc> https://www.yourdomain.com/collection1/document5 </loc>
<changefreq> weekly </changefreq>
</url>
</urlset>
Duplicating content on your sites
We know that publishers will often want to distribute an article from their digital edition to other platforms, including their own website.
Google’s definition of duplicate content is as follows:
That last part is important. If you scrape, copy and spin existing content, Google calls this copied content, with the intention of deceiving the search engine to get a higher ranking you will be on dangerous ground.
Google says this type of malicious intent might trigger an action:
In our experience, Google crawlers have historically viewed articles in the web reader in the context of the issue they belong to. The article is unique enough so as not to be seen as malicious, even if it is duplicated on other websites.
While not technically a penalty, duplicate content can still sometimes impact search engine rankings, as search engine results will only display one version of the article. This can potentially dilute the visibility of both versions and have an adverse effect on inbound link equity since other sites will link back to only one of the articles.
In order to avoid duplicate content, best practices suggest using canonical URLs. This points crawlers to the original article so they know which one should get SEO credit. For our publishers, the article usually originates in the digital edition, so they use the web reader article as the canonical URL in the reproduced article on their website.
In some cases, publishers want the article reproduction to be considered the original. However, our web reader has been engineered in such a way that prevents any article-specific metadata to be added manually, at least for the time being. Alternative methods, such as 301 redirects and nofollow entries in robots.txt are also not available at this time.