Downloading Multiple Files with wget
wget is a powerful command-line utility for non-interactive retrieval of files from the web. It is included in most Linux distributions, available for Windows via ports like Cygwin or WSL, and can be compiled from source on macOS. In this comprehensive article, we explore advanced techniques for downloading multiple files, optimizing performance, handling authentication, and ensuring privacy through VPNs.
1. Why Download Multiple Files
- Bulk data retrieval for datasets, software packages, or mirror sites.
- Website mirroring for offline browsing or backups.
- Automation of repetitive download tasks in scripts or cron jobs.
- Parallelization to reduce overall download time.
2. Basic Approaches
2.1 Specifying Multiple URLs on the Command Line
Simply list URLs separated by spaces:
wget http://example.com/file1.zip http://example.com/file2.zip http://example.com/file3.zip
Note: This approach is fine for a handful of files, but becomes unwieldy for dozens or hundreds of URLs.
2.2 Using an Input File
Prepare a plain text file urls.txt with one URL per line:
http://example.com/fileA.tar.gz http://example.com/fileB.tar.gz http://example.com/fileC.tar.gz
Then run:
wget -i urls.txt
This instructs wget to read URLs from the file. Use -c to continue interrupted downloads:
wget -c -i urls.txt
3. Advanced Techniques
3.1 Wildcards and Recursive Retrieval
When downloading from an HTTP or FTP server that lists directory contents, use --recursive and --accept:
wget --recursive --no-parent --accept=jpg,png,gif http://example.com/images/
Options explained:
--recursive (-r): descend into directories.--no-parent: avoid downloading parent directories.--accept: comma-separated list of extensions to accept.
3.2 Parallel Downloads with xargs or GNU Parallel
wget itself is single-threaded. To fire multiple instances in parallel:
cat urls.txt xargs -n 1 -P 4 wget -q
Explanation:
-n 1: one URL per wget invocation.-P 4: four parallel processes.-q: quiet mode (optional).
Alternatively, using GNU Parallel:
parallel -j 6 wget -q {} :::: urls.txt
3.3 Authentication and Cookies
For HTTP Basic authentication:
wget --user=USERNAME --password=PASSWORD -i protected_urls.txt
To maintain session cookies:
wget --load-cookies cookies.txt -i urls.txt
3.4 Bandwidth Throttling and Time Stamping
- Limit speed:
--limit-rate=200kenforces a 200 KB/s cap. - Time-based fetching:
--timestamping (-N)downloads only newer files, useful for incremental updates.
4. Managing Output and Directories
- -P DIR: place all downloads into DIR.
- -nd (no directories): avoid recreating directory hierarchy.
- –cut-dirs=N: ignore N directory components from remote URL.
5. Resilience and Error Handling
- Retries:
--tries=10sets up to 10 attempts per file. - Retry delay:
--wait=5pauses 5 seconds between retries. - Timeout:
--timeout=30aborts if no response in 30 seconds. - Backup names:
--backup-convertedsafely handles converted files.
6. Privacy and VPN Usage
When downloading from public servers or geographically restricted resources, you may wish to route your traffic through a VPN. This helps in:
- Obfuscating your IP address.
- Bypassing region locks.
- Enhancing encryption over insecure networks.
Recommended VPN services:
| VPN Provider | Website |
|---|---|
| NordVPN | https://nordvpn.com |
| ExpressVPN | https://expressvpn.com |
| ProtonVPN | https://protonvpn.com |
7. Best Practices and Tips
- Validate URLs: ensure your list file has correct, reachable URLs.
- Use checksums: verify integrity with
md5sumorsha256sum. - Log output:
--output-file=log.txtcaptures progress and errors. - Automate with cron: schedule periodic downloads for updated datasets.
- Monitor storage: be mindful of disk space when mirroring large sites.
8. Alternative Tools
While wget is versatile, specialized tools can offer optimized parallelism and checksumming:
- aria2: multi-protocol, multi-source, parallel downloads.
- curl: scriptable HTTP transfers (useful for APIs).
- rclone: cloud storage synchronization.
Conclusion
Mastering wget for downloading multiple files empowers you to automate data acquisition, mirror sites, and manage large sets of resources efficiently. By combining input files, recursion options, parallelization, and proper authentication, you can create robust scripts for any scenario. For enhanced privacy, route downloads through reputable VPN services like NordVPN, ExpressVPN, or ProtonVPN. Follow best practices for error handling, logging, and integrity checks to ensure a smooth, reliable workflow.
Leave a Reply