An example of using the Unix find command to quickly check out the number of different file extensions in a directory tree.
If you find yourself working on an unfamiliar website or codebase, the Unix find command provides a quick and easy way to get a high-level view of what’s lurking in there.
In most cases the file types will be identified by their filename extension – if we want to find all the different extensions under the current directory the following command will list them:-
# find . -type f -name "*.*" | sed 's/.*\.//' | sort -u JPG css gif htm html jpg js php png swf txt xap xml
Here we’re using find to locate all plain files (type f) under the current directory and its subdirectories which match *.* – i.e. there’s a period in the name so it (probably) has an extension. Next we use sed to remove everything preceding the final period so we have a list of each file’s extension. Finally we use sort – u to get a list of the unique ones.
Useful though it is to note that our codebase here has both upper case and lower case versions of the same extensions, it might not be what we’re looking for. If we want to treat upper case and lower case as equivalent we can use the tr command to make them all lower-case first.
# find . -type f -name "*.*" | sed 's/.*\.//' | tr '[:upper:]' '[:lower:]' | sort -u css gif htm html jpg js php png swf txt xap xml
Finally it might be useful to see how many of each file type we have. We can do this by using uniq -c, which provides a count of all unique values. It only counts adjacent matches though, so we’ll still need to do a sort first, dropping the -u as uniq will do that bit for us:-
# find . -type f -name "*.*" | sed 's/.*\.//' | tr '[:upper:]' '[:lower:]' | sort | uniq -c 151 css 587 gif 166 htm 86 html 257 jpg 439 js 1585 php 1332 png 54 swf 29 txt 6 xml
How can you handle files with multiple periods in the file name and files with spaces IE
foo.bar.tgz
foo.2016.8.14.tgz
In both of these cases I am looking for something that only shows .tgz and ignores the other leading periods
I’d expect the commands above to work okay in this scenario as the sed expression matches and removes everything up to the last period (at least it works that way on my Mac!) If you’re seeing different behaviour what sort of environment are you running in?