General
Manipulating all files with a given extension
# power chasis for f in directory/*.ext ; do n=`basename $f` fn=${n%.ext}; mycodehere > outdir/${fn}.newext ; done # for example quality filter all bam files in a directory for f in bam-uf/*.bam ; do n=`basename $f` fn=${n%.bam}; samtools view -b -q 20 -f 0x002 -F 0x004 -F 0x008 $f > bam-mq20/${fn}.q20.bam ; done
Remove empty lines
cat tocleanup.txt | perl -pe 's/^\s+$//' | sed '/^$/d' > cleaned.txt
Get a sequence from 0 to 100 in steps of 10
for i in `seq 0 10 100`; do echo $i; done
Rename all files with a given extension
#can be dangerous command, therefore two steps; #first check if done correct; just display command for f in *.fastq.gz; do n=${f%.fastq.gz}; co="mv $f ${n}.fq.gz"; echo $co ; done # than replac echo with eval to execute it for f in *.fastq.gz; do n=${f%.fastq.gz}; co="mv $f ${n}.fq.gz"; eval $co ; done
Copy all files with a given extension from deep subdirectories into one directory
for i in **/*sort.bam; do n=`basename $i`; cp $i ../../harvest/hot/$n; done
Copy content of multiple files into a novel one and add file-identifier
that’s pretty awesome 🙂
for i in **/*.mapedreads.txt; do d=`basename $i`; d=${d%.mapedreads.txt} ; cat $i|perl -spe 's/^/$d\t/' -- -d=$d; done
Kill all Java scripts
ps -e |grep 'java'|awk '{print $1}'|xargs kill -9
Zip all files with a given extension in all subdirectories
# this works just with the zsh ls -d **/*|grep '.cmh$'|xargs gzip &
sam/bam
Reheader a bam file
The following is necessary as the samtools reheader produces a defect bam file
samtools view toreheader.bam|cat newheader.sam -|samtools view -Sb - > reheadered.bam
fasta
Extract a subset of sequences
samtools faidx dmel-short-masked.fasta 2L
Print the length of the sequences in the fasta file
(actually samtools faidx would be more efficient)
awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' my.fasta | paste - - | sed 's/>//'
fastq
Split libraries from zipped fastq
gzip -cd testinput.fq.gz | paste - - - - | awk 'BEGIN{FS="\t"}$1~/#CGATGTAT\//' | tr "\t" "\n" | gzip -c > testoutput.fq.gz
Obtain reads within the given size range
eg reads with a length between 23nt and 29nt
cat r5-16.nucs15-35.fastq|paste - - - - |perl -ane 'print if length($F[2])>=23 && length($F[2])<=29'|tr "\t" "\n"|head
Subsample from fastq files
The trick is that the same seed seed (-s100) ensures that the two files are in sync; Seqtk https://github.com/lh3/seqtk
seqtk sample -s100 reads_1.fastq 1045174 > reads-ss_1.fastq seqtk sample -s100 reads_2.fastq 1045174 > reads-ss_2.fastq
Sort fastq file by read length, having the longest first (thanks Flo)
cat my.fastq|paste - - - - |perl -ne ’@x=split m/\t/; unshift @x, length($x[1]); print join “\t”,@x;‘|sort -n -r| cut -f2- |tr “\t” “\n” > mysorted.fastq