How to Find All Current and Archived URLs on a Website

There are many causes you could possibly will need to seek out each of the URLs on a web site, but your actual objective will ascertain Whatever you’re searching for. For illustration, you may want to:

Discover each and every indexed URL to research troubles like cannibalization or index bloat
Acquire existing and historic URLs Google has observed, specifically for web site migrations
Obtain all 404 URLs to Recuperate from write-up-migration problems
In each state of affairs, only one Resource received’t Provide you with everything you will need. Regrettably, Google Search Console isn’t exhaustive, and a “site:case in point.com” lookup is proscribed and tough to extract data from.

In this put up, I’ll stroll you through some equipment to develop your URL listing and right before deduplicating the info employing a spreadsheet or Jupyter Notebook, depending on your website’s sizing.

Outdated sitemaps and crawl exports
For those who’re on the lookout for URLs that disappeared with the Reside site recently, there’s a chance anyone in your workforce could possibly have saved a sitemap file or perhaps a crawl export prior to the improvements had been designed. Should you haven’t by now, check for these data files; they might generally deliver what you may need. But, in case you’re reading through this, you almost certainly did not get so lucky.

Archive.org
Archive.org
Archive.org is a useful Software for SEO duties, funded by donations. When you look for a domain and choose the “URLs” possibility, you are able to access up to ten,000 stated URLs.

Even so, There are some limits:

URL limit: You'll be able to only retrieve around web designer kuala lumpur ten,000 URLs, and that is inadequate for bigger web sites.
Quality: Numerous URLs could be malformed or reference resource data files (e.g., illustrations or photos or scripts).
No export selection: There isn’t a developed-in method to export the listing.
To bypass The dearth of an export button, utilize a browser scraping plugin like Dataminer.io. Nonetheless, these constraints indicate Archive.org may not supply a whole Alternative for greater internet sites. Also, Archive.org doesn’t reveal no matter whether Google indexed a URL—but when Archive.org identified it, there’s an excellent chance Google did, much too.

Moz Pro
Even though you would possibly typically utilize a link index to uncover exterior internet sites linking for you, these equipment also learn URLs on your website in the procedure.


Ways to utilize it:
Export your inbound backlinks in Moz Pro to acquire a rapid and straightforward list of concentrate on URLs from your internet site. If you’re managing a huge Site, consider using the Moz API to export details outside of what’s manageable in Excel or Google Sheets.

It’s essential to note that Moz Pro doesn’t validate if URLs are indexed or learned by Google. Nonetheless, because most web-sites utilize the identical robots.txt regulations to Moz’s bots because they do to Google’s, this technique usually operates effectively to be a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Research Console features several important resources for making your list of URLs.

Back links reports:


Just like Moz Pro, the Hyperlinks portion gives exportable lists of goal URLs. Regretably, these exports are capped at 1,000 URLs Each and every. You could use filters for distinct internet pages, but considering that filters don’t utilize for the export, you could should rely on browser scraping resources—limited to 500 filtered URLs at a time. Not perfect.

Functionality → Search engine results:


This export will give you a listing of webpages obtaining research impressions. Though the export is limited, You need to use Google Research Console API for larger datasets. You can also find absolutely free Google Sheets plugins that simplify pulling extra intensive knowledge.

Indexing → Pages report:


This part gives exports filtered by problem variety, although these are also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb source for collecting URLs, having a generous Restrict of one hundred,000 URLs.


Even better, you are able to utilize filters to build diverse URL lists, efficiently surpassing the 100k Restrict. As an example, if you wish to export only blog site URLs, adhere to these methods:

Move one: Add a segment towards the report

Phase 2: Click “Make a new section.”


Phase three: Determine the segment having a narrower URL sample, for instance URLs containing /weblog/


Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer useful insights.

Server log files
Server or CDN log information are Most likely the final word Software at your disposal. These logs seize an exhaustive record of each URL route queried by people, Googlebot, or other bots during the recorded period.

Concerns:

Information dimension: Log data files is often enormous, so many web-sites only keep the last two weeks of data.
Complexity: Analyzing log information can be tough, but a variety of applications are offered to simplify the method.
Merge, and very good luck
As you’ve collected URLs from all of these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for larger sized datasets, instruments like Google Sheets or Jupyter Notebook. Make certain all URLs are consistently formatted, then deduplicate the list.

And voilà—you now have a comprehensive list of recent, outdated, and archived URLs. Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *