In short, it’s a pretty effectively broken link finder. Great news for anyone with many years of experience blogging or using a CMS such as WordPress. Thanks to this, you can update all links to articles that have been placed in blog entries.
We install wget:
sudo apt-get install wget
For large websites, it is also worth getting a screenshot to save the session and return to it after our spider’s working hours:
sudo apt-get install screen
We run the screen:
screen
We create a file to launch our spider:
touch ./spider
chmod 755 ./spider
nano ./spider
And we write in the file:
wget --spider -o ./not-found.log -e robots=off -w 1 -r -E -H -p -nH -nd http://www.example.com
Where:
–spider – runs wget in spider mode – doesn’t download anything
-o – the result is written to the output file and not to the screen as by default
-e robots=off – does not take into account directives from robots.txt
-w 1 – waits 1 second between requests so as not to kill the server
-r – recursion – all found links will be taken into account
-E – specifying the correct extension of the HTML file
-H – follow external hosts
-p – will take into account other links, e.g. images – because our task is to collect information about any links that may cause errors on our website
-nH – does not create directories for external hosts
-nd – does not create local directories
We launch our spider:
sh ./spider
And we attach the screenshot session:
[ctrl]-[a] + [d]
We periodically check the progress:
screen -r
tail ./not-found.log -n50
[ctrl]-[a] + [d]
Once finished, it’s best to check for errors using grep:
grep -B 2 '404' ./not-found.log