Table of Contents
$ sort --version | head -n1
sort (GNU coreutils) 8.25
$ man sort
SORT(1) User Commands SORT(1)
NAME
sort - sort lines of text files
SYNOPSIS
sort [OPTION]... [FILE]...
sort [OPTION]... --files0-from=F
DESCRIPTION
Write sorted concatenation of all FILE(s) to standard output.
With no FILE, or when FILE is -, read standard input.
...Note: All examples shown here assumes ASCII encoded input file
$ cat poem.txt
Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.
$ sort poem.txt
And so are you.
Roses are red,
Sugar is sweet,
Violets are blue,- Well, that was easy. The lines were sorted alphabetically (ascending order by default) and it so happened that first letter alone was enough to decide the order
- For next example, let's extract all the words and sort them
- also allows to showcase
sortaccepting stdin - See GNU grep chapter if the
grepcommand used below looks alien
- also allows to showcase
$ # output might differ depending on locale settings
$ # note the case-insensitiveness of output
$ grep -oi '[a-z]*' poem.txt | sort
And
are
are
are
blue
is
red
Roses
so
Sugar
sweet
Violets
you- heed hereunto
- See also
$ info sort | tail
(1) If you use a non-POSIX locale (e.g., by setting ‘LC_ALL’ to
‘en_US’), then ‘sort’ may produce output that is sorted differently than
you’re accustomed to. In that case, set the ‘LC_ALL’ environment
variable to ‘C’. Note that setting only ‘LC_COLLATE’ has two problems.
First, it is ineffective if ‘LC_ALL’ is also set. Second, it has
undefined behavior if ‘LC_CTYPE’ (or ‘LANG’, if ‘LC_CTYPE’ is unset) is
set to an incompatible value. For example, you get undefined behavior
if ‘LC_CTYPE’ is ‘ja_JP.PCK’ but ‘LC_COLLATE’ is ‘en_US.UTF-8’.- Example to help show effect of locale setting
$ # note how uppercase is sorted before lowercase
$ grep -oi '[a-z]*' poem.txt | LC_ALL=C sort
And
Roses
Sugar
Violets
are
are
are
blue
is
red
so
sweet
you- This is simply reversing from default ascending order to descending order
$ sort -r poem.txt
Violets are blue,
Sugar is sweet,
Roses are red,
And so are you.$ cat numbers.txt
20
53
3
101
$ sort numbers.txt
101
20
3
53- Whoops, what happened there?
sortwon't know to treat them as numbers unless specified - Depending on format of numbers, different options have to be used
- First up is
-noption, which sorts based on numerical value
$ sort -n numbers.txt
3
20
53
101
$ sort -nr numbers.txt
101
53
20
3- The
-noption can handle negative numbers - As well as thousands separator and decimal point (depends on locale)
- The
<()syntax is Process Substitution- to put it simply - allows output of command to be passed as input file to another command without needing to manually create a temporary file
$ # multiple files are merged as single input by default
$ sort -n numbers.txt <(echo '-4')
-4
3
20
53
101
$ sort -n numbers.txt <(echo '1,234')
3
20
53
101
1,234
$ sort -n numbers.txt <(echo '31.24')
3
20
31.24
53
101- Use
-gif input contains numbers prefixed by+or E scientific notation
$ cat generic_numbers.txt
+120
-1.53
3.14e+4
42.1e-2
$ sort -g generic_numbers.txt
-1.53
42.1e-2
+120
3.14e+4- Commands like
duhave options to display numbers in human readable formats sortsupports sorting such numbers using the-hoption
$ du -sh *
104K power.log
746M projects
316K report.log
20K sample.txt
$ du -sh * | sort -h
20K sample.txt
104K power.log
316K report.log
746M projects
$ # --si uses powers of 1000 instead of 1024
$ du -s --si *
107k power.log
782M projects
324k report.log
21k sample.txt
$ du -s --si * | sort -h
21k sample.txt
107k power.log
324k report.log
782M projects- Version sort - dealing with numbers mixed with other characters
- If this sorting is needed simply while displaying directory contents, use
ls -vinstead of piping tosort -V
$ cat versions.txt
foo_v1.2
bar_v2.1.3
foobar_v2
foo_v1.2.1
foo_v1.3
$ sort -V versions.txt
bar_v2.1.3
foobar_v2
foo_v1.2
foo_v1.2.1
foo_v1.3- Another common use case is when there are multiple filenames differentiated by numbers
$ cat files.txt
file0
file10
file3
file4
$ sort -V files.txt
file0
file3
file4
file10- Can be used when dealing with numbers reported by
timecommand as well
$ # different solving durations
$ cat rubik_time.txt
5m35.363s
3m20.058s
4m5.099s
4m1.130s
3m42.833s
4m33.083s
$ # assuming consistent min/sec format
$ sort -V rubik_time.txt
3m20.058s
3m42.833s
4m1.130s
4m5.099s
4m33.083s
5m35.363s- Note that duplicate lines will always end up next to each other
- might be useful as a feature for some cases ;)
- Use
shufif this is not desirable
- See also How can I shuffle the lines of a text file on the Unix command line or in a shell script?
$ cat nums.txt
1
10
10
12
23
563
$ # the two 10s will always be next to each other
$ sort -R nums.txt
563
12
1
10
10
23
$ # duplicates can end up anywhere
$ shuf nums.txt
10
23
1
10
563
12- The
-ooption can be used to specify output file - Useful for in place editing
$ sort -R nums.txt -o rand_nums.txt
$ cat rand_nums.txt
23
1
10
10
563
12
$ sort -R nums.txt -o nums.txt
$ cat nums.txt
563
23
10
10
1
12- Use shell script looping if there multiple files to be sorted in place
- Below snippet is for
bashshell
$ for f in *.txt; do echo sort -V "$f" -o "$f"; done
sort -V files.txt -o files.txt
sort -V rubik_time.txt -o rubik_time.txt
sort -V versions.txt -o versions.txt
$ # remove echo once commands look fine
$ for f in *.txt; do sort -V "$f" -o "$f"; done- Keep only first copy of lines that are deemed to be same according to
sortoption used
$ cat duplicates.txt
foo
12 carrots
foo
12 apples
5 guavas
$ # only one copy of foo in output
$ sort -u duplicates.txt
12 apples
12 carrots
5 guavas
foo- According to option used, definition of duplicate will vary
- For example, when
-nis used, matching numbers are deemed same even if rest of line differs- Pipe the output to
uniqif this is not desirable
- Pipe the output to
$ # note how first copy of line starting with 12 is retained
$ sort -nu duplicates.txt
foo
5 guavas
12 carrots
$ # use uniq when entire line should be compared to find duplicates
$ sort -n duplicates.txt | uniq
foo
5 guavas
12 apples
12 carrots- Use
-foption to ignore case of alphabets while determining duplicates
$ cat words.txt
CAR
are
car
Are
foot
are
$ # only the two 'are' were considered duplicates
$ sort -u words.txt
are
Are
car
CAR
foot
$ # note again that first copy of duplicate is retained
$ sort -fu words.txt
are
CAR
footFrom info sort
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
Specify a sort field that consists of the part of the line between
POS1 and POS2 (or the end of the line, if POS2 is omitted),
_inclusive_.
Each POS has the form ‘F[.C][OPTS]’, where F is the number of the
field to use, and C is the number of the first character from the
beginning of the field. Fields and character positions are
numbered starting with 1; a character position of zero in POS2
indicates the field’s last character. If ‘.C’ is omitted from
POS1, it defaults to 1 (the beginning of the field); if omitted
from POS2, it defaults to 0 (the end of the field). OPTS are
ordering options, allowing individual keys to be sorted according
to different rules; see below for details. Keys can span multiple
fields.
- By default, blank characters (space and tab) serve as field separators
$ cat fruits.txt
apple 42
guava 6
fig 90
banana 31
$ sort fruits.txt
apple 42
banana 31
fig 90
guava 6
$ # sort based on 2nd column numbers
$ sort -k2,2n fruits.txt
guava 6
banana 31
apple 42
fig 90- Using a different field separator
- Consider the following sample input file having fields separated by
:
$ # name:pet_name:no_of_pets
$ cat pets.txt
foo:dog:2
xyz:cat:1
baz:parrot:5
abcd:cat:3
joe:dog:1
bar:fox:1
temp_var:squirrel:4
boss:dog:10- Sorting based on particular column or column to end of line
- In case of multiple entries, by default
sortwould use content of remaining parts of line to resolve
$ # only 2nd column
$ # -k2,4 would mean 2nd column to 4th column
$ sort -t: -k2,2 pets.txt
abcd:cat:3
xyz:cat:1
boss:dog:10
foo:dog:2
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ # from 2nd column to end of line
$ sort -t: -k2 pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
boss:dog:10
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4- Multiple keys can be specified to resolve ties
- Note that if there are still multiple entries with specified keys, remaining parts of lines would be used
$ # default sort for 2nd column, numeric sort on 3rd column to resolve ties
$ sort -t: -k2,2 -k3,3n pets.txt
xyz:cat:1
abcd:cat:3
joe:dog:1
foo:dog:2
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ # numeric sort on 3rd column, default sort for 2nd column to resolve ties
$ sort -t: -k3,3n -k2,2 pets.txt
xyz:cat:1
joe:dog:1
bar:fox:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10- Use
-soption to retain original order of lines in case of tie
$ sort -s -t: -k2,2 pets.txt
xyz:cat:1
abcd:cat:3
foo:dog:2
joe:dog:1
boss:dog:10
bar:fox:1
baz:parrot:5
temp_var:squirrel:4- The
-uoption, as seen earlier, will retain only first match
$ sort -u -t: -k2,2 pets.txt
xyz:cat:1
foo:dog:2
bar:fox:1
baz:parrot:5
temp_var:squirrel:4
$ sort -u -t: -k3,3n pets.txt
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10- Sometimes, the input has to be sorted first and then
-uused on the sorted output - See also remove duplicates based on the value of another column
$ # sort by number in 3rd column
$ sort -t: -k3,3n pets.txt
bar:fox:1
joe:dog:1
xyz:cat:1
foo:dog:2
abcd:cat:3
temp_var:squirrel:4
baz:parrot:5
boss:dog:10
$ # then get unique entry based on 2nd column
$ sort -t: -k3,3n pets.txt | sort -t: -u -k2,2
xyz:cat:1
joe:dog:1
bar:fox:1
baz:parrot:5
temp_var:squirrel:4- Specifying particular characters within fields
- If character position is not specified, defaults to
1for starting column and0(last character) for ending column
$ cat marks.txt
fork,ap_12,54
flat,up_342,1.2
fold,tn_48,211
more,ap_93,7
rest,up_5,63
$ # for 2nd column, sort numerically only from 4th character to end
$ sort -t, -k2.4,2n marks.txt
rest,up_5,63
fork,ap_12,54
fold,tn_48,211
more,ap_93,7
flat,up_342,1.2
$ # sort uniquely based on first two characters of line
$ sort -u -k1.1,1.2 marks.txt
flat,up_342,1.2
fork,ap_12,54
more,ap_93,7
rest,up_5,63- If there are headers
$ cat header.txt
fruit qty
apple 42
guava 6
fig 90
banana 31
$ # separate and combine header and content to be sorted
$ cat <(head -n1 header.txt) <(tail -n +2 header.txt | sort -k2nr)
fruit qty
fig 90
apple 42
banana 31
guava 6- There are many other options apart from handful presented above. See
man sortandinfo sortfor detailed documentation and more examples - sort like a master
- When -b to ignore leading blanks is needed
- sort Q&A on unix stackexchange
- sort on multiple columns using -k option
- sort a string character wise
- Scalability of 'sort -u' for gigantic files
$ uniq --version | head -n1
uniq (GNU coreutils) 8.25
$ man uniq
UNIQ(1) User Commands UNIQ(1)
NAME
uniq - report or omit repeated lines
SYNOPSIS
uniq [OPTION]... [INPUT [OUTPUT]]
DESCRIPTION
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
...$ cat word_list.txt
are
are
to
good
bad
bad
bad
good
are
bad
$ # adjacent duplicate lines are removed, leaving one copy
$ uniq word_list.txt
are
to
good
bad
good
are
bad
$ # To remove duplicates from entire file, input has to be sorted first
$ # also showcases that uniq accepts stdin as input
$ sort word_list.txt | uniq
are
bad
good
to$ # duplicates adjacent to each other
$ uniq -d word_list.txt
are
bad
$ # duplicates in entire file
$ sort word_list.txt | uniq -d
are
bad
good- To get only duplicates as well as show all duplicates
$ uniq -D word_list.txt
are
are
bad
bad
bad
$ sort word_list.txt | uniq -D
are
are
are
bad
bad
bad
bad
good
good- To distinguish the different groups
$ # using --all-repeated=prepend will add a newline before the first group as well
$ sort word_list.txt | uniq --all-repeated=separate
are
are
are
bad
bad
bad
bad
good
good$ # lines with no adjacent duplicates
$ uniq -u word_list.txt
to
good
good
are
bad
$ # unique lines in entire file
$ sort word_list.txt | uniq -u
to$ # adjacent lines
$ uniq -c word_list.txt
2 are
1 to
1 good
3 bad
1 good
1 are
1 bad
$ # entire file
$ sort word_list.txt | uniq -c
3 are
4 bad
2 good
1 to
$ # entire file, only duplicates
$ sort word_list.txt | uniq -cd
3 are
4 bad
2 good- Sorting by count
$ # sort by count
$ sort word_list.txt | uniq -c | sort -n
1 to
2 good
3 are
4 bad
$ # reverse the order, highest count first
$ sort word_list.txt | uniq -c | sort -nr
4 bad
3 are
2 good
1 to- To get only entries with min/max count, bit of awk magic would help
$ # consider this result
$ sort colors.txt | uniq -c | sort -nr
3 Red
3 Blue
2 Yellow
1 Green
1 Black
$ # to get all max count
$ # save 1st line 1st column value to c and then print if 1st column equals c
$ sort colors.txt | uniq -c | sort -nr | awk 'NR==1{c=$1} $1==c'
3 Red
3 Blue
$ # to get all min count
$ sort colors.txt | uniq -c | sort -n | awk 'NR==1{c=$1} $1==c'
1 Black
1 Green- Get rough count of most used commands from
historyfile
$ # awk '{print $1}' will get the 1st column alone
$ awk '{print $1}' "$HISTFILE" | sort | uniq -c | sort -nr | head
1465 echo
1180 grep
552 cd
531 awk
451 sed
423 vi
418 cat
392 perl
325 printf
320 sort
$ # extract command name from start of line or preceded by 'spaces|spaces'
$ # won't catch commands in other places like command substitution though
$ grep -oP '(^| +\| +)\K[^ ]+' "$HISTFILE" | sort | uniq -c | sort -nr | head
2006 grep
1469 echo
933 sed
698 awk
552 cd
513 perl
510 cat
453 sort
423 vi
327 printf$ cat another_list.txt
food
Food
good
are
bad
Are
$ # note how first copy is retained
$ uniq -i another_list.txt
food
good
are
bad
Are
$ uniq -iD another_list.txt
food
Food$ sort -f word_list.txt another_list.txt | uniq -i
are
bad
food
good
to
$ sort -f word_list.txt another_list.txt | uniq -c
4 are
1 Are
5 bad
1 food
1 Food
3 good
1 to
$ sort -f word_list.txt another_list.txt | uniq -ic
5 are
5 bad
2 food
3 good
1 to- If only adjacent lines (not sorted) is required, need to concatenate files using another command
$ uniq -id word_list.txt
are
bad
$ uniq -id another_list.txt
food
$ cat word_list.txt another_list.txt | uniq -id
are
bad
fooduniqhas few options dealing with column manipulations. Not extensive assort -kbut handy for some cases- First up, skipping fields
- No option to specify different delimiter
- From
info uniq: Fields are sequences of non-space non-tab characters that are separated from each other by at least one space or tab - Number of spaces/tabs between fields should be same
$ cat shopping.txt
lemon 5
mango 5
banana 8
bread 1
orange 5
$ # skips first field
$ uniq -f1 shopping.txt
lemon 5
banana 8
bread 1
orange 5
$ # use -f3 to skip first three fields and so on- Skipping characters
$ cat text
glue
blue
black
stack
stuck
$ # don't consider first 2 characters
$ uniq -s2 text
glue
black
stuck
$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 2nd column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck- Upto specified characters
$ # consider only first 2 characters
$ uniq -w2 text
glue
blue
stack
$ # to visualize the above example
$ # assume there are two fields and uniq is applied on 1st column
$ sed 's/^../& /' text
gl ue
bl ue
bl ack
st ack
st uck- Combining
-sand-w - Can be combined with
-fas well
$ # skip first 3 characters and then use next 2 characters
$ uniq -s3 -w2 text
glue
black- Do check out
man uniqandinfo uniqfor other options and more detailed documentation - uniq Q&A on unix stackexchange
- process duplicate lines only based on certain fields
$ comm --version | head -n1
comm (GNU coreutils) 8.25
$ man comm
COMM(1) User Commands COMM(1)
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
When FILE1 or FILE2 (not both) is -, read standard input.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
...Consider below sample input files
$ # sorted input files viewed side by side
$ paste colors_1.txt colors_2.txt
Blue Black
Brown Blue
Purple Green
Red Red
Teal White
Yellow- Without any option,
commgives 3 column output- lines unique to first file
- lines unique to second file
- lines common to both files
$ comm colors_1.txt colors_2.txt
Black
Blue
Brown
Green
Purple
Red
Teal
White
Yellow-1suppress lines unique to first file-2suppress lines unique to second file-3suppress lines common to both files
$ # suppressing column 3
$ comm -3 colors_1.txt colors_2.txt
Black
Brown
Green
Purple
Teal
White
Yellow- Combining options gives three distinct and useful constructs
- First, getting only common lines to both files
$ comm -12 colors_1.txt colors_2.txt
Blue
Red- Second, lines unique to first file
$ comm -23 colors_1.txt colors_2.txt
Brown
Purple
Teal
Yellow- And the third, lines unique to second file
$ comm -13 colors_1.txt colors_2.txt
Black
Green
White- See also how the above three cases can be done using grep alone
- Note input files do not need to be sorted for
grepsolution
- Note input files do not need to be sorted for
If different sort order than default is required, use --nocheck-order to ignore error message
$ comm -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
comm: file 1 is not in sorted order
20
53
101
$ comm --nocheck-order -23 <(sort -n numbers.txt) <(sort -n nums.txt)
3
20
53
101- As many duplicate lines match in both files, they'll be considered as common
- Rest will be unique to respective files
- This is useful for cases like finding lines present in first but not in second taking in to consideration count of duplicates as well
- This solution won't be possible with
grep
- This solution won't be possible with
$ paste list1 list2
a a
a b
a c
b c
b d
c
$ comm list1 list2
a
a
a
b
b
c
c
d
$ comm -23 list1 list2
a
a
bman commandinfo commfor more options and detailed documentation- comm Q&A on unix stackexchange
$ shuf --version | head -n1
shuf (GNU coreutils) 8.25
$ man shuf
SHUF(1) User Commands SHUF(1)
NAME
shuf - generate random permutations
SYNOPSIS
shuf [OPTION]... [FILE]
shuf -e [OPTION]... [ARG]...
shuf -i LO-HI [OPTION]...
DESCRIPTION
Write a random permutation of the input lines to standard output.
With no FILE, or when FILE is -, read standard input.
...- Without repeating input lines
$ cat nums.txt
1
10
10
12
23
563
$ # duplicates can end up anywhere
$ # all lines are part of output
$ shuf nums.txt
10
23
1
10
563
12
$ # limit max number of output lines
$ shuf -n2 nums.txt
563
23- Use
-ooption to specify output file name instead of displaying on stdout - Helpful for inplace editing
$ shuf nums.txt -o nums.txt
$ cat nums.txt
10
12
23
10
563
1- With repeated input lines
$ # -n3 for max 3 lines, -r allows input lines to be repeated
$ shuf -n3 -r nums.txt
1
1
563
$ seq 3 | shuf -n5 -r
2
1
2
1
2
$ # if a limit using -n is not specified, shuf will output lines indefinitely- use
-eoption to specify multiple input lines from command line itself
$ shuf -e red blue green
green
blue
red
$ shuf -e 'hi there' 'hello world' foo bar
bar
hi there
foo
hello world
$ shuf -n2 -e 'hi there' 'hello world' foo bar
foo
hi there
$ shuf -r -n4 -e foo bar
foo
foo
bar
foo- The
-ioption accepts integer range as input to be shuffled
$ shuf -i 3-8
3
7
6
4
8
5- Combine with other options as needed
$ shuf -n3 -i 3-8
5
4
7
$ shuf -r -n4 -i 3-8
5
5
7
8
$ shuf -r -n5 -i 0-1
1
0
0
1
1- Use seq input if negative numbers, floating point, etc are needed
$ seq 2 -1 -2 | shuf
2
-1
-2
0
1
$ seq 0.3 0.1 0.7 | shuf -n3
0.4
0.5
0.7man shufandinfo shuffor more options and detailed documentation- Generate random numbers in specific range
- Variable - randomly choose among three numbers
- Related to 'random' stuff: