Parallel¶
1. Introduction¶
There are many common tasks in Linux that we may want to consider running in parallel, such as:
- Downloading a large number of files
- Encoding/decoding a large number of images on a machine with multiple CPU cores
- Making a computation with many different parameters and storing the results
Of course, we can accomplish all these tasks without using parallelization. But if we process each file, connection, orĀ computation in several parallel processes, we can have a great advantage in terms of speed. Luckily, there are multiple powerful command-line tools for parallelization in Linux systems that can help us achieve this.
In this tutorial, weāre going to see how to use the Bash ampersandĀ &Ā operator,Ā xargs, andĀ GNUĀ parallelĀ to achieve parallelization on the Linux command line.
2. A Sample Task¶
First, letās create a simple script that weāll run in parallel.
Letās create a file namedĀ ./processĀ with contents:
This script will fake an actual process that takes 2 to 5 seconds to complete. Letās make it executable to be able to use it:
3. Using &¶
As a basic way to run commands in parallel, we can use the built-inĀ Bash ampersandĀ &Ā operatorĀ to run a command asynchronously so that the shell doesnāt wait for the current command to complete before moving on to the next one:
This will create two processes that will start at essentially the same instant and run in parallel. Because weāve introduced randomĀ sleepĀ times in our example script, the output may look like this:
Clearly, we can use this approach to run many parallel processes. But if we have many tasks ā for example, a hundred images to be converted ā we wouldnāt want to start all hundred tasks at once, but instead, process them in batches to utilize our cores better.Ā To achieve this, we need to wait for some tasks to complete before starting others.
3.1. Using wait with &¶
TheĀ waitĀ command will, by default, wait for all child processes to exit. So, using theĀ waitĀ command, we can run batches of operations:
AD
However, thereās one big downside to this approach. To utilize our CPU cores effectively, weād want a new process to start as soon as a running process ends.Ā But with this solution, we wouldnāt start new processes until all the tasks in the previous batch were completed. To overcome this limitation, we can useĀ xargs.
4. Using _xargs¶
xargsĀ is a command-line toolĀ that helps us run commands with arguments parsed from standard input. It can also parallelize our tasks for us.
Letās try the previous input we used with &, but this time withĀ xargs:
xargsĀ immediately creates the next process once a process is completed.Ā We specify the number of arguments per call using theĀ -nĀ argument and the number of parallel tasks using theĀ -PĀ argument.
4.1. Using Replacement¶
If the executable weāre using requires us to put the arguments to some specific place rather than appending them directly after the executable name, we can use replacement.
Letās try it:
4.2. Handling Arguments With Newlines¶
If the arguments we want to use with our processes include newline characters, we can use a null character (\0) delimited input stream. For example, with theĀ findĀ command, we can set the output to be null-delimited instead of newline-delimited by using theĀ -print0Ā flag:
As the arguments are now null-delimited, we can be sure that newline characters in the input will be preserved.
5. Using GNU parallel¶
GNUĀ parallelĀ is one of the most advanced command-line tools available for running parallel tasks. It has many features, including the ability to distribute and run tasks remotely on multiple machines usingĀ ssh.
5.1. Basic Usage¶
The basic usage ofĀ parallelĀ is very similar toĀ xargs. Actually, for simple cases, we can use it interchangeably withĀ xargs.
Letās try:
TheĀ ājobsĀ argument is the same as theĀ xargsĀ commandāsĀ -PĀ argument, which determines the maximum number of parallel jobs to be running at the same time.
By default,Ā parallelĀ will print the output of a process only after it is finished. TheĀ āungroupĀ flag disables this functionality. We can use it to see the actual execution order of commands as they are running.
We can supply the input arguments also via the command line. Letās try running it to get the same output as above:
when supplying command-line arguments, we can use ::: (three colons) to supply arguments directly, and :::: (four colons) to supply arguments from a file.
Letās see an example that supplies input from a file:
The output would be similar to the above.
5.2. Running Combinations of Multiple Sources¶
We can useĀ parallelĀ to run tasks for every possible combination of two sources.
Letās try it for two sample sources:
5.3. Linking Sources¶
If instead of running for every possible combination, we want to ālinkā them after one another, we would use theĀ ālinkĀ flag. Letās try it with two different input sources:
5.4. Replacement Strings¶
Like inĀ xargs, we can use replacement strings inĀ parallel. The default replacement string is {}.
Letās try it with a prefix:
Other replacement strings do different kinds of manipulations on the input. For example, {.} will remove the extension from the argument:
If we want to use multiple different variables for each command, we can also do this using special replacement strings:
There are also many more options for replacement strings that can be found in theĀ parallelĀ tutorial.
5.5. Reading Input From File Columns¶
We can read the input from different columns of a text file. Letās try it with a tab-separated text file:
5.6. Saving Output¶
We can save the output of each process into a file by using theĀ āfilesĀ flag:
This will createĀ *.parĀ files with the output of our commands as the content.
If we want to have a more friendly directory structure, we can use theĀ āresultsĀ andĀ āheaderĀ arguments to write the results to a folder in a hierarchy.
Letās run a command to generate the directory tree:
Now, letās check the output using theĀ treeĀ command:
parallelĀ generates the directory structure based on argument positions and values.
5.7. Progress Information¶
We can also haveĀ parallelĀ show an estimate of the remaining time based on current task runs:
5.8. Running Parallel Tasks on Remote Machines¶
We can run our parallel tasks on remote machines usingĀ parallelĀ throughĀ ssh.
Letās assume we have access toĀ host1Ā andĀ host2Ā using our username andĀ sshĀ keys that are added to our system. Letās try it:
The hosts that will run each command and the order of execution will change randomly with every run.
6. Conclusion[¶
In this article, we learned how to use the Bash ampersandĀ &Ā operator,Ā xargs, and GNUĀ parallelĀ to parallelize our tasks on the command line.