Run Commands in Parallel with GNU parallel

Introduction

GNU parallel is a powerful shell tool for executing jobs in parallel using one or more computers. Whether youre processing large datasets, running numerous simulations, or automating repetitive tasks, GNU parallel can dramatically reduce your execution time and simplify your workflows.

What Is GNU parallel

Developed by Ole Tange, GNU parallel is designed to replace tools like xargs or simple shell loops. It splits input data, launches multiple jobs concurrently, and collects the results in the correct order.

Key Features

  • Automatic job balancing based on load and CPU count
  • Support for sequences, files, commands outputs, and find results
  • Job control: timeouts, retries, resource limits
  • Grouping of stdout and stderr per job
  • Remote execution via SSH and integration with clusters

Installation

On Debian/Ubuntu

sudo apt-get update  
sudo apt-get install parallel

On macOS (Homebrew)

brew install parallel

From Source

git clone git://git.savannah.gnu.org/parallel.git  
cd parallel  
./configure  make  sudo make install

Basic Usage

At its simplest, you can pipe a list of arguments into parallel:

cat urls.txt  parallel wget {}

This command will download each URL from urls.txt in parallel, spawning as many wget processes as you have CPU cores.

Specifying Job Slots

Control the number of concurrent jobs with -j or --jobs:

parallel -j 4 rsync -avz {} backup:/data/ ::: dir1 dir2 dir3 dir4 dir5

Use Cases

  • Batch image conversion (convert, ffmpeg)
  • Data processing pipelines (Python, R, awk)
  • Simulations and modeling
  • Remote backups and deployments
  • Unit test parallelization in CI systems

Advanced Features

1. Replacement Strings

You can refer to columns of input with {1}, {2}, or even {.} (removes extension). Example:

parallel convert {1}.png -resize {2}% {1}_{2}.png ::: img1 img2 ::: 50 75 100

2. Grouping Output

Prevent interleaving with --group (default) or --ungroup:

parallel --ungroup echo Processing {}

3. Job Log and Checkpointing

Keep track of job status and resume incomplete tasks:

parallel --joblog mylog.txt --resume mycommand.sh

4. Timeout and Retries

Define maximum execution time and retries on failure:

parallel --timeout 300 --retries 3 process_file {} ::: .dat

5. Remote Execution over SSH

Distribute jobs across multiple servers. Make sure your SSH connections are secure—consider using a VPN such as
NordVPN or
ExpressVPN if you operate over untrusted networks.

parallel --sshlogin server1,server2 -j 0 python analysis.py {} ::: sample1 sample2 sample3

Performance Considerations

To maximize throughput:

  • Adjust -j to match CPU threads or I/O bandwidth.
  • Use --load to avoid overwhelming the system.
  • Group small tasks to reduce spawning overhead.
  • Prefer in-memory operations when possible.

Comparing GNU parallel and xargs

Feature GNU parallel xargs
Parallel jobs Yes, built-in Limited, -P flag
Replacement strings Extensive ({}, {1}, {#}) Basic % substitution
Job logging Yes (–joblog) No

Best Practices

  • Quote your commands to avoid shell expansions.
  • Use --dry-run to verify command lines before execution.
  • Implement timeouts and retries for unreliable tasks.
  • Keep logs for auditing and troubleshooting.
  • Test on small datasets before scaling up.

Real-World Examples

Parallel Video Encoding

ls .mov  parallel -j 8 ffmpeg -i {} -c:v libx265 -crf 28 {.}.mp4

Distributed Data Aggregation

parallel --sshloginfile servers.txt -j 0 Rscript analyze.R {} ::: data_chunk

Tips and Tricks

  • Combine --eta to display estimated completion times.
  • Use --bar for a progress bar in interactive sessions.
  • Leverage environment files: parallel --env VAR1,VAR2 ….
  • Chain commands with shell grouping: parallel (cd {} ampamp make) ::: project1 project2.

Resources Further Reading

Resource Link
Official Manual parallel manual
Tutorial by Ole Tange parallel tutorial
Stack Overflow Discussions gnu-parallel tag

Conclusion

GNU parallel is a versatile and efficient way to accelerate shell-based tasks. From simple file conversions to massive distributed computations, it empowers you to harness your hardware—or cluster—effectively. By following best practices, profiling your workloads, and leveraging advanced features, you can streamline complex pipelines and achieve significant time savings.

Download TXT



Leave a Reply

Your email address will not be published. Required fields are marked *