Your own simple spider using WGET

In short, it’s a pretty effectively broken link finder. Great news for anyone with many years of experience blogging or using a CMS such as WordPress. Thanks to this, you can update all links to articles that have been placed in blog entries.

We install wget:

sudo apt-get install wget

For large websites, it is also worth getting a screenshot to save the session and return to it after our spider’s working hours:

sudo apt-get install screen

We run the screen:

screen

We create a file to launch our spider:

touch ./spider
chmod 755 ./spider
nano ./spider

And we write in the file:

wget --spider -o ./not-found.log -e robots=off -w 1 -r -E -H -p -nH -nd http://www.example.com

Where:

–spider – runs wget in spider mode – doesn’t download anything
-o – the result is written to the output file and not to the screen as by default
-e robots=off – does not take into account directives from robots.txt
-w 1 – waits 1 second between requests so as not to kill the server
-r – recursion – all found links will be taken into account
-E – specifying the correct extension of the HTML file
-H – follow external hosts
-p – will take into account other links, e.g. images – because our task is to collect information about any links that may cause errors on our website
-nH  – does not create directories for external hosts
-nd – does not create local directories

We launch our spider:

sh ./spider

And we attach the screenshot session:

[ctrl]-[a] + [d]

We periodically check the progress:

screen -r 
tail ./not-found.log -n50 
[ctrl]-[a] + [d]

Once finished, it’s best to check for errors using grep:

grep -B 2 '404' ./not-found.log
5/5 - (3 votes)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top