Unix – One liner

General

Manipulating all files with a given extension

# power chasis
for f in directory/*.ext ; do n=`basename $f` fn=${n%.ext}; mycodehere > outdir/${fn}.newext ; done
 
# for example quality filter all bam files in a directory
for f in bam-uf/*.bam ; do n=`basename $f` fn=${n%.bam}; samtools view -b -q 20 -f 0x002 -F 0x004 -F 0x008 $f > bam-mq20/${fn}.q20.bam ; done

Remove empty lines

cat tocleanup.txt | perl -pe 's/^\s+$//' | sed '/^$/d' > cleaned.txt

Get a sequence from 0 to 100 in steps of 10

for i in `seq 0 10 100`; do echo $i; done

Rename all files with a given extension

#can be dangerous command, therefore two steps;
#first check if done correct; just display command
for f in *.fastq.gz; do n=${f%.fastq.gz}; co="mv $f ${n}.fq.gz"; echo $co ; done
# than replac echo with eval to execute it
for f in *.fastq.gz; do n=${f%.fastq.gz}; co="mv $f ${n}.fq.gz"; eval $co ; done

Copy all files with a given extension  from deep subdirectories into one directory

for i in **/*sort.bam; do n=`basename $i`; cp $i ../../harvest/hot/$n; done

Copy content of multiple files into a novel one and add file-identifier

that’s pretty awesome 🙂

for i in **/*.mapedreads.txt; do d=`basename $i`; d=${d%.mapedreads.txt} ; cat $i|perl -spe 's/^/$d\t/' -- -d=$d; done

Kill all Java scripts

ps -e |grep 'java'|awk '{print $1}'|xargs kill -9

Zip all files with a given extension in all subdirectories

# this works just with the zsh
ls -d **/*|grep '.cmh$'|xargs gzip &

sam/bam

Reheader a bam file

The following is necessary as the samtools reheader produces a defect bam file

samtools view toreheader.bam|cat newheader.sam -|samtools view -Sb - > reheadered.bam

fasta

Extract a subset of sequences 

samtools faidx dmel-short-masked.fasta 2L

Print the length of the sequences in the fasta file

(actually samtools faidx would be more efficient)

awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' my.fasta | paste - - | sed 's/>//'

fastq

Split libraries from zipped fastq

gzip -cd testinput.fq.gz | paste - - - - | awk 'BEGIN{FS="\t"}$1~/#CGATGTAT\//' | tr "\t" "\n" | gzip -c > testoutput.fq.gz

Obtain reads within the given size range

eg reads with a length between 23nt and 29nt

cat r5-16.nucs15-35.fastq|paste - - - - |perl -ane 'print if length($F[2])>=23 && length($F[2])<=29'|tr "\t" "\n"|head

Subsample from fastq files

The trick is that the same seed seed (-s100) ensures that the two files are in sync; Seqtk https://github.com/lh3/seqtk

seqtk sample -s100 reads_1.fastq 1045174 > reads-ss_1.fastq
seqtk sample -s100 reads_2.fastq 1045174 > reads-ss_2.fastq

Sort fastq file by read length, having the longest first (thanks Flo)

cat my.fastq|paste - - - - |perl -ne ’@x=split m/\t/; unshift @x, length($x[1]); print join “\t”,@x;‘|sort -n -r| cut -f2- |tr “\t” “\n” > mysorted.fastq