This code example uses BASH to show how to get all PDF, DOC or XLS links from a URL and download the files to a specified directory.


The Explanation of the Script:


1) Create a new BASH script file

$: vi scraper.sh

2) Declare the file as a BASH file

#!/bin/bash
clear;

3) Collect the info for the script

# This is the URL to the webpage that has the files to scrape
read -e -p "URL to scrape: " url;

# The location of where curl will save the files
read -e -p "Path to store files (/path/to/files/dir_name or ./dir_name): " path;

# To only download from a specific domain, we need to add a filter
read -e -p "Only scrape from what domain? (i.e. www.dangibson.me - no http/https or '/'): " domain;

4) The curl GET command

# Finds an absolute link - if there are relative (or local) links, you'll need to modify this line
c=$(curl -X GET ${url} --silent | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*");

5) Create an array with all curl entities

# adds each link as an array element
entity_arr=($(echo $c | tr ";" "\n"))

6) Print the split string and save the file to a specified location

# For each array element, we print the response for the user and save the files to the specified location using WGET
for i in "${entity_arr[@]}"
do
    if [[ $i == *"$domain"* ]]; then
        if [[ $i == *".pdf"* ]] || [[ $i == *".doc"* ]] || [[ $i == *".xls"* ]]; then
            echo $i
            wget $i -P $path >/dev/null 2>&1
        fi
    fi
done

7) Finally, let the user know the script has completed

echo 'Done'

Here’s the Entire Script:


#!/bin/bash
clear;

# This is the URL to the webpage that has the files to scrape
read -e -p "URL to scrape: " url;

# The location of where curl will save the files
read -e -p "Path to store files (/path/to/files/dir_name or ./dir_name): " path;

$ To only download from a specific domain, we need to add a filter
read -e -p "Only scrape from what domain? (i.e. www.dangibson.me - no http/https or '/'): " domain;

#Finds an absolute link - if there are relative (or local) links, you'll need to modify this line
c=$(curl -X GET ${url} --silent | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*");

# adds each link as an array element
entity_arr=($(echo $c | tr ";" "\n"))

# For each array element, we print the response for the user and save the files to the specified location using WGET
for i in "${entity_arr[@]}"
do
    if [[ $i == *"$domain"* ]]; then
        if [[ $i == *".pdf"* ]] || [[ $i == *".doc"* ]] || [[ $i == *".xls"* ]]; then
            echo $i
            wget $i -P $path >/dev/null 2>&1
        fi
    fi
done

echo 'Done'