Download Multiple Files with wget

Downloading Multiple Files with wget

wget is a powerful command-line utility for non-interactive retrieval of files from the web. It is included in most Linux distributions, available for Windows via ports like Cygwin or WSL, and can be compiled from source on macOS. In this comprehensive article, we explore advanced techniques for downloading multiple files, optimizing performance, handling authentication, and ensuring privacy through VPNs.

1. Why Download Multiple Files

  • Bulk data retrieval for datasets, software packages, or mirror sites.
  • Website mirroring for offline browsing or backups.
  • Automation of repetitive download tasks in scripts or cron jobs.
  • Parallelization to reduce overall download time.

2. Basic Approaches

2.1 Specifying Multiple URLs on the Command Line

Simply list URLs separated by spaces:

wget http://example.com/file1.zip http://example.com/file2.zip http://example.com/file3.zip

Note: This approach is fine for a handful of files, but becomes unwieldy for dozens or hundreds of URLs.

2.2 Using an Input File

Prepare a plain text file urls.txt with one URL per line:

http://example.com/fileA.tar.gz
http://example.com/fileB.tar.gz
http://example.com/fileC.tar.gz

Then run:

wget -i urls.txt

This instructs wget to read URLs from the file. Use -c to continue interrupted downloads:

wget -c -i urls.txt

3. Advanced Techniques

3.1 Wildcards and Recursive Retrieval

When downloading from an HTTP or FTP server that lists directory contents, use --recursive and --accept:

wget --recursive --no-parent --accept=jpg,png,gif http://example.com/images/

Options explained:

  • --recursive (-r): descend into directories.
  • --no-parent: avoid downloading parent directories.
  • --accept: comma-separated list of extensions to accept.

3.2 Parallel Downloads with xargs or GNU Parallel

wget itself is single-threaded. To fire multiple instances in parallel:

cat urls.txt  xargs -n 1 -P 4 wget -q

Explanation:

  • -n 1: one URL per wget invocation.
  • -P 4: four parallel processes.
  • -q: quiet mode (optional).

Alternatively, using GNU Parallel:

parallel -j 6 wget -q {} :::: urls.txt

3.3 Authentication and Cookies

For HTTP Basic authentication:

wget --user=USERNAME --password=PASSWORD -i protected_urls.txt

To maintain session cookies:

wget --load-cookies cookies.txt -i urls.txt

3.4 Bandwidth Throttling and Time Stamping

  • Limit speed: --limit-rate=200k enforces a 200 KB/s cap.
  • Time-based fetching: --timestamping (-N) downloads only newer files, useful for incremental updates.

4. Managing Output and Directories

  • -P DIR: place all downloads into DIR.
  • -nd (no directories): avoid recreating directory hierarchy.
  • –cut-dirs=N: ignore N directory components from remote URL.

5. Resilience and Error Handling

  • Retries: --tries=10 sets up to 10 attempts per file.
  • Retry delay: --wait=5 pauses 5 seconds between retries.
  • Timeout: --timeout=30 aborts if no response in 30 seconds.
  • Backup names: --backup-converted safely handles converted files.

6. Privacy and VPN Usage

When downloading from public servers or geographically restricted resources, you may wish to route your traffic through a VPN. This helps in:

  • Obfuscating your IP address.
  • Bypassing region locks.
  • Enhancing encryption over insecure networks.

Recommended VPN services:

VPN Provider Website
NordVPN https://nordvpn.com
ExpressVPN https://expressvpn.com
ProtonVPN https://protonvpn.com

7. Best Practices and Tips

  • Validate URLs: ensure your list file has correct, reachable URLs.
  • Use checksums: verify integrity with md5sum or sha256sum.
  • Log output: --output-file=log.txt captures progress and errors.
  • Automate with cron: schedule periodic downloads for updated datasets.
  • Monitor storage: be mindful of disk space when mirroring large sites.

8. Alternative Tools

While wget is versatile, specialized tools can offer optimized parallelism and checksumming:

  • aria2: multi-protocol, multi-source, parallel downloads.
  • curl: scriptable HTTP transfers (useful for APIs).
  • rclone: cloud storage synchronization.

Conclusion

Mastering wget for downloading multiple files empowers you to automate data acquisition, mirror sites, and manage large sets of resources efficiently. By combining input files, recursion options, parallelization, and proper authentication, you can create robust scripts for any scenario. For enhanced privacy, route downloads through reputable VPN services like NordVPN, ExpressVPN, or ProtonVPN. Follow best practices for error handling, logging, and integrity checks to ensure a smooth, reliable workflow.

Download TXT



Leave a Reply

Your email address will not be published. Required fields are marked *