What is Complete wget Guide: Master File Downloads & Web Scraping about?

Master wget, the powerful command-line tool for downloading files, mirroring websites, and automating web retrieval tasks with this comprehensive guide.

Who should read this article?

This article is perfect for technology professionals, developers, and anyone interested in system administration looking to enhance their skills and knowledge.

How long does it take to read?

This article takes approximately 12 minutes to read and contains 2213 words of expert insights and practical information.

What topics are covered?

This article covers key topics including: Command Line, Linux, file-download, web scraping, wget, providing comprehensive insights for technology professionals.

Complete wget Guide: Master File...

Complete Guide to wget: The Ultimate File Download Tool

1. [Introduction](#introduction) 2. [Installation](#installation) 3. [Basic Syntax and Usage](#basic-syntax-and-usage) 4. [Common Options and Parameters](#common-options-and-parameters) 5. [Download Scenarios](#download-scenarios) 6. [Advanced Features](#advanced-features) 7. [Configuration Files](#configuration-files) 8. [Error Handling and Troubleshooting](#error-handling-and-troubleshooting) 9. [Best Practices](#best-practices) 10. [Examples and Use Cases](#examples-and-use-cases)

Introduction

wget is a free, non-interactive command-line utility for downloading files from the web. It supports HTTP, HTTPS, and FTP protocols and provides extensive functionality for retrieving files, entire websites, and handling various download scenarios. The name "wget" comes from "World Wide Web" and "get," reflecting its primary purpose of retrieving content from the internet.

wget is particularly powerful because it can work in the background, handle network interruptions gracefully, and resume interrupted downloads. It's an essential tool for system administrators, developers, and power users who need reliable file downloading capabilities.

Key Features

- Non-interactive operation suitable for scripts and automation - Recursive downloading capabilities - Resume interrupted downloads - Bandwidth throttling - Proxy support - SSL/TLS support - Cookie handling - User agent customization - Mirror entire websites - Background operation

Installation

Linux Systems

Most Linux distributions include wget by default. If not installed, use the following commands:

Ubuntu/Debian: `bash sudo apt update sudo apt install wget `

CentOS/RHEL/Fedora: `bash

CentOS/RHEL

sudo yum install wget

or for newer versions

sudo dnf install wget

Fedora

sudo dnf install wget `

Arch Linux: `bash sudo pacman -S wget `

macOS

Using Homebrew: `bash brew install wget `

Using MacPorts: `bash sudo port install wget `

Windows

Download wget for Windows from the GNU wget website or use package managers like Chocolatey:

`powershell choco install wget `

Verification

Verify installation by checking the version: `bash wget --version `

Basic Syntax and Usage

General Syntax

`bash wget [options] [URL] `

Simple Download

The most basic usage is downloading a single file:

`bash wget https://example.com/file.zip `

This command downloads the file to the current directory with its original filename.

Specifying Output Filename

Use the -O option to specify a different filename:

`bash wget -O newname.zip https://example.com/file.zip `

Downloading to Specific Directory

Use the -P option to specify the download directory:

`bash wget -P /path/to/directory https://example.com/file.zip `

Common Options and Parameters

Output and Logging Options

| Option | Description | Example | |--------|-------------|---------| | -O filename | Save document to filename | wget -O document.pdf https://example.com/doc.pdf | | -P directory | Save files to directory | wget -P ~/Downloads https://example.com/file.zip | | -o logfile | Log messages to logfile | wget -o download.log https://example.com/file.zip | | -a logfile | Append messages to logfile | wget -a download.log https://example.com/file.zip | | -q | Quiet mode (no output) | wget -q https://example.com/file.zip | | -v | Verbose mode | wget -v https://example.com/file.zip | | --progress=type | Progress indicator type | wget --progress=bar https://example.com/file.zip |

Download Control Options

| Option | Description | Example | |--------|-------------|---------| | -c | Continue partial downloads | wget -c https://example.com/largefile.zip | | -t number | Number of retries | wget -t 5 https://example.com/file.zip | | -T seconds | Timeout in seconds | wget -T 30 https://example.com/file.zip | | --limit-rate=rate | Limit download rate | wget --limit-rate=200k https://example.com/file.zip | | -w seconds | Wait between retrievals | wget -w 2 https://example.com/file.zip | | --random-wait | Random wait between downloads | wget --random-wait https://example.com/file.zip |

HTTP Options

| Option | Description | Example | |--------|-------------|---------| | --user-agent=agent | Set user agent string | wget --user-agent="Mozilla/5.0" https://example.com/file.zip | | --referer=url | Set referer URL | wget --referer="https://example.com" https://example.com/file.zip | | --header=header | Add custom header | wget --header="Accept: application/json" https://api.example.com/data | | --post-data=string | Send POST data | wget --post-data="param=value" https://example.com/api | | --cookies=on/off | Enable/disable cookies | wget --cookies=on https://example.com/file.zip | | --load-cookies=file | Load cookies from file | wget --load-cookies=cookies.txt https://example.com/file.zip | | --save-cookies=file | Save cookies to file | wget --save-cookies=cookies.txt https://example.com/file.zip |

Authentication Options

| Option | Description | Example | |--------|-------------|---------| | --http-user=user | HTTP username | wget --http-user=john https://example.com/protected/file.zip | | --http-password=pass | HTTP password | wget --http-password=secret https://example.com/protected/file.zip | | --ask-password | Prompt for password | wget --http-user=john --ask-password https://example.com/protected/file.zip | | --certificate=file | Client certificate file | wget --certificate=client.crt https://secure.example.com/file.zip | | --private-key=file | Private key file | wget --private-key=client.key https://secure.example.com/file.zip |

Download Scenarios

Single File Download

Basic file download with progress indication:

`bash wget https://releases.ubuntu.com/20.04/ubuntu-20.04.3-desktop-amd64.iso `

Multiple Files Download

Download multiple files by specifying multiple URLs:

`bash wget https://example.com/file1.zip https://example.com/file2.zip https://example.com/file3.zip `

Download from File List

Create a file containing URLs and use -i option:

`bash

Create urls.txt file with URLs

echo "https://example.com/file1.zip" > urls.txt echo "https://example.com/file2.zip" >> urls.txt echo "https://example.com/file3.zip" >> urls.txt

Download all files from the list

wget -i urls.txt `

Resume Interrupted Downloads

Use the -c option to continue interrupted downloads:

`bash wget -c https://example.com/largefile.zip `

Background Downloads

Use the -b option for background downloads:

`bash wget -b https://example.com/largefile.zip `

Check progress with: `bash tail -f wget-log `

Bandwidth Limiting

Limit download speed to avoid consuming all available bandwidth:

`bash

Limit to 200 KB/s

wget --limit-rate=200k https://example.com/largefile.zip

Limit to 1 MB/s

wget --limit-rate=1m https://example.com/largefile.zip `

Advanced Features

Recursive Downloads

wget can recursively download entire websites or directory structures:

#### Basic Recursive Download

`bash wget -r https://example.com/ `

#### Website Mirroring

Create a complete mirror of a website:

`bash wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://example.com/ `

Options explanation: - --mirror: Enable mirroring options - --convert-links: Convert links for local viewing - --adjust-extension: Add appropriate extensions to files - --page-requisites: Download CSS, images, etc. - --no-parent: Don't ascend to parent directory

#### Recursive Download Options

| Option | Description | Example | |--------|-------------|---------| | -r | Recursive download | wget -r https://example.com/ | | -l depth | Maximum recursion depth | wget -r -l 3 https://example.com/ | | -A pattern | Accept only matching files | wget -r -A "*.pdf" https://example.com/ | | -R pattern | Reject matching files | wget -r -R ".gif,.jpg" https://example.com/ | | --include-directories=list | Include only specified directories | wget -r --include-directories=docs https://example.com/ | | --exclude-directories=list | Exclude specified directories | wget -r --exclude-directories=temp https://example.com/ |

FTP Downloads

wget supports FTP protocol for downloading files:

#### Anonymous FTP

`bash wget ftp://ftp.example.com/pub/file.zip `

#### FTP with Authentication

`bash wget --ftp-user=username --ftp-password=password ftp://ftp.example.com/private/file.zip `

#### Recursive FTP Download

`bash wget -r ftp://ftp.example.com/pub/directory/ `

Proxy Support

wget can work through various proxy servers:

#### HTTP Proxy

`bash wget --proxy=on --http-proxy=proxy.example.com:8080 https://example.com/file.zip `

#### Environment Variables

Set proxy through environment variables:

`bash export http_proxy=http://proxy.example.com:8080 export https_proxy=http://proxy.example.com:8080 export ftp_proxy=http://proxy.example.com:8080 wget https://example.com/file.zip `

SSL/TLS Options

| Option | Description | Example | |--------|-------------|---------| | --secure-protocol=protocol | Choose SSL/TLS protocol | wget --secure-protocol=TLSv1_2 https://example.com/file.zip | | --no-check-certificate | Don't verify SSL certificates | wget --no-check-certificate https://self-signed.example.com/file.zip | | --ca-certificate=file | Use specific CA certificate | wget --ca-certificate=ca.crt https://example.com/file.zip | | --ca-directory=directory | CA certificates directory | wget --ca-directory=/etc/ssl/certs https://example.com/file.zip |

Configuration Files

Global Configuration

wget reads configuration from several locations:

1. System-wide configuration: /etc/wgetrc 2. User configuration: ~/.wgetrc 3. Environment variable: $WGETRC

Configuration File Format

Configuration files use simple key-value pairs:

~/.wgetrc example

Set default timeout

timeout = 30

Set default number of retries

tries = 3

Set default user agent

user_agent = Mozilla/5.0 (compatible; wget)

Enable cookies by default

cookies = on

Set default download directory

dir_prefix = /home/user/Downloads

Limit download rate

limit_rate = 500k

Enable continue by default

continue = on

Be more verbose

verbose = on `

Common Configuration Options

| Option | Description | Example | |--------|-------------|---------| | timeout | Default timeout | timeout = 30 | | tries | Default retry count | tries = 5 | | user_agent | Default user agent | user_agent = MyBot/1.0 | | limit_rate | Default rate limit | limit_rate = 200k | | dir_prefix | Default download directory | dir_prefix = /downloads | | continue | Enable resume by default | continue = on | | recursive | Enable recursive by default | recursive = off |

Error Handling and Troubleshooting

Common Exit Codes

| Exit Code | Description | |-----------|-------------| | 0 | No problems occurred | | 1 | Generic error code | | 2 | Parse error | | 3 | File I/O error | | 4 | Network failure | | 5 | SSL verification failure | | 6 | Username/password authentication failure | | 7 | Protocol errors | | 8 | Server issued an error response |

Common Issues and Solutions

#### SSL Certificate Problems

Problem: SSL certificate verification fails `bash ERROR: cannot verify example.com's certificate `

Solutions: `bash

Option 1: Skip certificate verification (not recommended for production)

wget --no-check-certificate https://example.com/file.zip

Option 2: Update CA certificates

sudo apt update && sudo apt install ca-certificates

Option 3: Specify CA certificate

wget --ca-certificate=/path/to/ca.crt https://example.com/file.zip `

#### Connection Timeouts

Problem: Downloads timeout frequently

Solutions: `bash

Increase timeout and retries

wget -T 60 -t 10 https://example.com/file.zip

Add wait between retries

wget -T 60 -t 10 -w 5 https://example.com/file.zip `

#### Rate Limiting

Problem: Server blocks requests due to high frequency

Solutions: `bash

Add random wait between requests

wget --random-wait --wait=1 -r https://example.com/

Limit download rate

wget --limit-rate=100k https://example.com/file.zip `

#### 403 Forbidden Errors

Problem: Server returns 403 Forbidden

Solutions: `bash

Set proper user agent

wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" https://example.com/file.zip

Set referer

wget --referer="https://example.com" https://example.com/file.zip

Add custom headers

wget --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8" https://example.com/file.zip `

Debugging Options

| Option | Description | Example | |--------|-------------|---------| | --debug | Enable debug output | wget --debug https://example.com/file.zip | | -v | Verbose output | wget -v https://example.com/file.zip | | --server-response | Show server headers | wget --server-response https://example.com/file.zip | | --spider | Check if file exists without downloading | wget --spider https://example.com/file.zip |

Best Practices

Security Considerations

1. Verify SSL certificates: Avoid using --no-check-certificate in production 2. Use HTTPS when available: Prefer secure connections 3. Protect credentials: Use configuration files with proper permissions for authentication 4. Validate downloads: Check file integrity using checksums when available

Performance Optimization

1. Use appropriate retry settings: Balance between reliability and speed 2. Implement rate limiting: Respect server resources and avoid being blocked 3. Resume interrupted downloads: Use -c for large files 4. Optimize recursive downloads: Use appropriate depth limits and filters

Automation Best Practices

1. Use absolute paths: Specify full paths in scripts 2. Implement error handling: Check exit codes in scripts 3. Log activities: Use logging options for debugging and monitoring 4. Use configuration files: Centralize common settings

Example Script

`bash #!/bin/bash

Download script with error handling

URL="https://example.com/largefile.zip" OUTPUT_DIR="/downloads" LOG_FILE="/var/log/wget.log"

Create output directory if it doesn't exist

mkdir -p "$OUTPUT_DIR"

Download with error handling

wget \ --continue \ --timeout=60 \ --tries=5 \ --limit-rate=500k \ --directory-prefix="$OUTPUT_DIR" \ --output-file="$LOG_FILE" \ --progress=bar \ "$URL"

Check exit code

if [ $? -eq 0 ]; then echo "Download completed successfully" else echo "Download failed with exit code $?" exit 1 fi `

Examples and Use Cases

Website Mirroring

Complete website mirror for offline browsing:

`bash wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --domains example.com \ --no-parent \ https://example.com/ `

Downloading Software Releases

Download the latest software release:

`bash

Download with resume capability and progress bar

wget \ --continue \ --progress=bar:force \ --timeout=30 \ --tries=5 \ https://github.com/user/project/releases/download/v1.0.0/software.tar.gz `

API Data Retrieval

Download data from REST API:

`bash

Download JSON data with authentication

wget \ --header="Authorization: Bearer TOKEN" \ --header="Accept: application/json" \ --output-document=data.json \ https://api.example.com/data `

Batch File Downloads

Download multiple files with pattern matching:

`bash

Download all PDF files from a directory listing

wget \ --recursive \ --no-parent \ --accept="*.pdf" \ --level=1 \ https://example.com/documents/ `

FTP Site Synchronization

Synchronize local directory with FTP site:

`bash

Mirror FTP directory

wget \ --mirror \ --ftp-user=username \ --ftp-password=password \ --no-host-directories \ --cut-dirs=1 \ ftp://ftp.example.com/pub/files/ `

Scheduled Downloads

Create a cron job for regular downloads:

`bash

Add to crontab (crontab -e)

Download daily at 2 AM

0 2 * /usr/bin/wget --quiet --output-document=/backup/daily-$(date +\%Y\%m\%d).sql https://example.com/backup.sql `

Large File Download with Monitoring

Download large files with detailed monitoring:

`bash #!/bin/bash

URL="https://example.com/large-dataset.zip" OUTPUT="/data/dataset.zip" LOG="/var/log/dataset-download.log"

Start download in background with detailed logging

wget \ --continue \ --timeout=300 \ --tries=10 \ --waitretry=30 \ --limit-rate=1m \ --progress=bar:force:noscroll \ --output-file="$LOG" \ --output-document="$OUTPUT" \ "$URL" &

WGET_PID=$!

Monitor progress

while kill -0 $WGET_PID 2>/dev/null; do if [ -f "$OUTPUT" ]; then SIZE=$(du -h "$OUTPUT" | cut -f1) echo "Downloaded: $SIZE" fi sleep 10 done

wait $WGET_PID EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then echo "Download completed successfully" # Verify download if checksum available # sha256sum -c dataset.zip.sha256 else echo "Download failed with exit code $EXIT_CODE" fi `

This comprehensive guide covers the essential aspects of using wget for file downloads. The tool's flexibility and extensive options make it suitable for everything from simple file downloads to complex web scraping and site mirroring tasks. Understanding these features and best practices will help you use wget effectively in various scenarios.

Complete Guide to wget: The Ultimate File Download Tool

Table of Contents

Introduction

Key Features

Installation

Linux Systems

CentOS/RHEL

or for newer versions

Fedora

macOS

Windows

Verification

Basic Syntax and Usage

General Syntax

Simple Download

Specifying Output Filename

Downloading to Specific Directory

Common Options and Parameters

Output and Logging Options

Download Control Options

HTTP Options

Authentication Options

Download Scenarios

Single File Download

Multiple Files Download

Download from File List

Create urls.txt file with URLs

Download all files from the list

Resume Interrupted Downloads

Background Downloads

Bandwidth Limiting

Limit to 200 KB/s

Limit to 1 MB/s

Advanced Features

Recursive Downloads

FTP Downloads

Proxy Support

SSL/TLS Options

Configuration Files

Global Configuration

Configuration File Format

~/.wgetrc example

Set default timeout

Set default number of retries

Set default user agent

Enable cookies by default

Set default download directory

Limit download rate

Enable continue by default

Be more verbose

Common Configuration Options

Error Handling and Troubleshooting

Common Exit Codes

Common Issues and Solutions

Option 1: Skip certificate verification (not recommended for production)

Option 2: Update CA certificates

Option 3: Specify CA certificate

Increase timeout and retries

Add wait between retries

Add random wait between requests

Limit download rate

Set proper user agent

Set referer

Add custom headers

Debugging Options

Best Practices

Security Considerations

Performance Optimization

Automation Best Practices

Example Script

Download script with error handling

Create output directory if it doesn't exist

Download with error handling

Check exit code

Examples and Use Cases

Website Mirroring

Downloading Software Releases

Download with resume capability and progress bar

API Data Retrieval

Download JSON data with authentication

Complete wget Guide: Master File Downloads & Web Scraping