Introduction
GNU parallel is a powerful shell tool for executing jobs in parallel using one or more computers. Whether youre processing large datasets, running numerous simulations, or automating repetitive tasks, GNU parallel can dramatically reduce your execution time and simplify your workflows.
What Is GNU parallel
Developed by Ole Tange, GNU parallel is designed to replace tools like xargs or simple shell loops. It splits input data, launches multiple jobs concurrently, and collects the results in the correct order.
Key Features
- Automatic job balancing based on load and CPU count
- Support for sequences, files, commands outputs, and find results
- Job control: timeouts, retries, resource limits
- Grouping of stdout and stderr per job
- Remote execution via SSH and integration with clusters
Installation
On Debian/Ubuntu
sudo apt-get update sudo apt-get install parallel
On macOS (Homebrew)
brew install parallel
From Source
git clone git://git.savannah.gnu.org/parallel.git cd parallel ./configure make sudo make install
Basic Usage
At its simplest, you can pipe a list of arguments into parallel:
cat urls.txt parallel wget {}
This command will download each URL from urls.txt in parallel, spawning as many wget processes as you have CPU cores.
Specifying Job Slots
Control the number of concurrent jobs with -j or --jobs:
parallel -j 4 rsync -avz {} backup:/data/ ::: dir1 dir2 dir3 dir4 dir5
Use Cases
- Batch image conversion (convert, ffmpeg)
- Data processing pipelines (Python, R, awk)
- Simulations and modeling
- Remote backups and deployments
- Unit test parallelization in CI systems
Advanced Features
1. Replacement Strings
You can refer to columns of input with {1}, {2}, or even {.} (removes extension). Example:
parallel convert {1}.png -resize {2}% {1}_{2}.png ::: img1 img2 ::: 50 75 100
2. Grouping Output
Prevent interleaving with --group (default) or --ungroup:
parallel --ungroup echo Processing {}
3. Job Log and Checkpointing
Keep track of job status and resume incomplete tasks:
parallel --joblog mylog.txt --resume mycommand.sh
4. Timeout and Retries
Define maximum execution time and retries on failure:
parallel --timeout 300 --retries 3 process_file {} ::: .dat
5. Remote Execution over SSH
Distribute jobs across multiple servers. Make sure your SSH connections are secure—consider using a VPN such as
NordVPN or
ExpressVPN if you operate over untrusted networks.
parallel --sshlogin server1,server2 -j 0 python analysis.py {} ::: sample1 sample2 sample3
Performance Considerations
To maximize throughput:
- Adjust
-jto match CPU threads or I/O bandwidth. - Use
--loadto avoid overwhelming the system. - Group small tasks to reduce spawning overhead.
- Prefer in-memory operations when possible.
Comparing GNU parallel and xargs
| Feature | GNU parallel | xargs |
|---|---|---|
| Parallel jobs | Yes, built-in | Limited, -P flag |
| Replacement strings | Extensive ({}, {1}, {#}) | Basic % substitution |
| Job logging | Yes (–joblog) | No |
Best Practices
- Quote your commands to avoid shell expansions.
- Use
--dry-runto verify command lines before execution. - Implement timeouts and retries for unreliable tasks.
- Keep logs for auditing and troubleshooting.
- Test on small datasets before scaling up.
Real-World Examples
Parallel Video Encoding
ls .mov parallel -j 8 ffmpeg -i {} -c:v libx265 -crf 28 {.}.mp4
Distributed Data Aggregation
parallel --sshloginfile servers.txt -j 0 Rscript analyze.R {} ::: data_chunk
Tips and Tricks
- Combine
--etato display estimated completion times. - Use
--barfor a progress bar in interactive sessions. - Leverage environment files:
parallel --env VAR1,VAR2 …. - Chain commands with shell grouping:
parallel (cd {} ampamp make) ::: project1 project2.
Resources Further Reading
| Resource | Link |
|---|---|
| Official Manual | parallel manual |
| Tutorial by Ole Tange | parallel tutorial |
| Stack Overflow Discussions | gnu-parallel tag |
Conclusion
GNU parallel is a versatile and efficient way to accelerate shell-based tasks. From simple file conversions to massive distributed computations, it empowers you to harness your hardware—or cluster—effectively. By following best practices, profiling your workloads, and leveraging advanced features, you can streamline complex pipelines and achieve significant time savings.
Leave a Reply