- In:
- Posted By: abrahamdsl
- Comments: 0
Back in April, my boss generated lots of SNP/indels and he intends to have it displayed on the web portal I'm developing for the use of the C4 Rice Project Consortium Members. I was stuck for a long time looking for a pre-existing software/script to do the job. It is because per experience, the task seems so simple so most probably there could be a lot of existing ones somewhere out there. Search here, search there, read threads here, there - found out some ( ex: iPlant Collaborative - but it's not public | Another one on Github ), but they just end up not working. So, finally did my own. And since there a lot of threads that feature similar predicament, I'm publishing it to the web. I spoke to my supervisor about this and he assured me that it would be okay, it's not confidential/subject to the IRRI Intellectual Property matters, add to that, we doing Bioinformatics here always use open-source if not free programs/scripts, and these are also contributed by the community.
So, here it is. You can also visit Github : https://github.com/abrahamdsl/vcf_to_gff3/blob/master/vcf_to_gff3.pl .
Sorry for the cosmetic inconvenience. I'm still lost in the world of WYSIWYG editors in Drupal. :D Please go to the Github link above instead if you find this stressing.
#!/usr/bin/perl =doc a [dot] llave [at] irri.org 28APR2014-1445 A basic VCF to GFF3 conversion tool. Includes AF and DP (allele frequency and read depth respectively) from INFO column of VCF. A spin-off from /home/applications/vaast_converter.pl which was referred from http://gmod.827538.n3.nabble.com/vcf-to-gff3-td3434214.html Copyright (C) 2014 Abraham Darius S. Llave under the C4 Rice Project, International Rice Research Institute This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. =cut use strict; use warnings; use Getopt::Long; #----------------------------------------------------------------------------- #----------------------------------- MAIN ------------------------------------ #----------------------------------------------------------------------------- # specify your source my $sourceName = ""; # used for identifying the feature's source as can be gleamed from the feature ID/name my $featureIDStart = ""; # this would be part of the feature ID/name to identify the dataset. my $datasetTag = ""; my $countSNP = 0; my $countINDEL = 0; my $debugMode = 0; my $noUnderscore = 0; my $separator = "_"; my $usage = " Synopsis: vcf_to_gff3.pl [options] file1.vcf [file2.vcf...] Options: --help Optional. Display this help. --path Optional. Default directory to write results. Defaults to current/`pwd`. --orgsource Optional. A string describing the organization/institute that is source of the features, for column 2 of GFF file. Defaults to a dot. --orgsource2 Optional. A string describing the organization/institute that is source of the features, put in the beginning of the name. Default to 'loc'. --string_tag Required. A string of minlength 1 to describe the features. --nounderscore Optional. Do not separate the --orgsource from the rest of the characters in the features' name/ID. Default false. Description: This script will convert VCF files to GFF3. Take note, that outputted info are limited to location, variance, allele frequency and read depth. By default structure of feature ID/name in quasi-regex: (orgsource|'loc')(undesrcore)?<(I|S)><string_tag><order of feature among S/I> "; my ($help, $build, $append, $path); my $opt_success = GetOptions('help' => \$help, 'nounderscore' => \$noUnderscore, 'orgsource=s' => \$sourceName, 'orgsource2=s' => \$featureIDStart, 'path=s' => \$path, 'string_tag=s' => \$datasetTag ); die $usage if $help || ! $opt_success; $path ||= './'; $featureIDStart ||= "loc"; $sourceName ||= "."; if ($noUnderscore) { $separator = "" }; handle_message('FATAL', 'directory_does_not_exist', $path) unless -e $path; handle_message('FATAL', 'path_is_not_a_directory', $path) unless -d $path; handle_message('FATAL', 'directory_is_not_writable', $path) unless -w $path; handle_message('FATAL', 'string_tag should be at least 1 char', $datasetTag) unless ( length($datasetTag) > 0 ); handle_message('NOTICE', 'string_tag ', $datasetTag); my @vcf_files = @ARGV; if (! @vcf_files) { print $usage; handle_message('FATAL', 'no_vcf_files_to_convert'); } for my $vcf_file (@vcf_files) { if ( $vcf_file !~ /\./ ){ handle_message('FATAL', 'VCF file should have an extension name (i.e. dot)'); } ( my $vcf_file_withoutExt = $vcf_file ) =~ s/\.\w+$//; my $gff_file = $vcf_file_withoutExt . ".gff"; handle_message('FATAL', 'file_does_not_exist', $vcf_file) unless -e $vcf_file; open(my $IN, '<', $vcf_file) or handle_message('FATAL', 'cant_open_file_for_reading', $vcf_file); open(my $OUThandle, ">", $gff_file) or handle_message('FATAL', 'cant_open_file_for_writing', $gff_file); print $OUThandle "##gff-version 3\n"; print $OUThandle "# Generated " . localtime() . ' via https://github.com/abrahamdsl/vcf_to_gff3' . "\n"; =debug # a [dot] llave [at] irri.org THIS MIGHT NOT BE NEEDED IN OUR CASE my $header; while (my $line = <$IN>) { if ($line =~ /^#CHROM/) { $header = $line; last; } } handle_message('FATAL', 'no_header_line_found', 'File was read, but no header line was found') unless $header; my @ids = split /\s/, $header; splice(@ids, 0, 9); =cut handle_message( 'NOTICE', 'Processing file', $vcf_file ); handle_message( 'NOTICE', 'Writing to file', $gff_file ); while (my $line = <$IN>) { if( $line =~ /^#/ or $line =~ /^super/ ){ next; } my @data = split /\t/, $line; #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 my %record; @record{qw(chrom pos id ref alt qual filter info format)} = splice(@data, 0, 9); my @feature_ids = split /;/, $record{id}; # more often than not this is just '.' my $feature_id = shift @feature_ids; my $seqid = $record{chrom}; my $source; my $type; my $start = $record{pos}; my $end = $start; my $score = $record{qual}; # In VCF, variants are always in forward representation. # Citation: http://www.1000genomes.org/faq/what-strand-are-variants-your-vcf-file my $strand = '+'; my $phase = '.'; # not needed to specify. my $note; my $featureID; my $reference_seq = uc $record{ref}; my @infos = split( /;/, $record{info} ); # Which details in the INFO field do you want to include? my @wanted = qw(AF DP); my $wantedCount = scalar( @wanted ); my $count = 0; my %includeMisc; my @variant_seqs = split /,/, $record{alt}; map {$_ = uc $_} @variant_seqs; $featureID = $featureIDStart; foreach( @infos ){ my @temp = split( /=/, $_ ); if( grep { $_ eq $temp[0] } @wanted ){ # are there multiple variances? my @diffCases = split( /,/, $temp[1] ); foreach( @diffCases ){ # see if it has decimal and worthy to be included (i.e !A.XY WHERE XY are not both zero and A is natural num) my @digits = split ( /\./, $_ ); my $value; if( scalar( @digits ) > 1 ){ if( $digits[1] > 0 ){ $value = $_; # decimal is worthy, include $value =~ s/0{1,}$//; # remove trailing zeroes }else{ $value = $digits[0]; # no decimal, just the whole num / left part of the num in decimal } }else{ $value = $_; } push( @{ $includeMisc{ $temp[0] } }, $value ); $count++; } } if( $count >= $wantedCount ){ last; } } =info I opted to add "_I" or "_S" to indicate INDEL or SNP, you might want to modify these parts accordingly. =cut if( length($reference_seq) > 1 or grep { length $_ > 1 } @variant_seqs ){ $type = "indel"; # It seems to me, Chado and GBrowse is not okay if this is in CAPS! $end = ($start + length($reference_seq)) - 1; $countINDEL += 1; $featureID .= ( $separator . "I" . $datasetTag . $countINDEL ); }else{ $type = "SNP"; $countSNP += 1; $featureID .= ( $separator . "S" . $datasetTag . $countSNP ); } foreach( @variant_seqs ){ $note = "$reference_seq>$_,"; foreach my $key( sort keys %includeMisc ){ $note .= " $key=" . shift(@{ $includeMisc{$key} }); } print $OUThandle "$seqid\t$sourceName\t$type\t$start\t$end\t$score\t$strand\t$phase\tID=$featureID;Name=$featureID;Note=$note\n"; } } handle_message( 'NOTICE', 'Finished', 'Written to $gff_file .' ); } sub handle_message { my $message; my ($level, $code, $sentMsg) = @_; $sentMsg ||= "No message."; chomp $sentMsg; $message .= ( localtime() . ' ' . $sentMsg . "\n" ); if ($level eq 'FATAL') { die join ' : ', ($level, $code, $message); }else { print STDERR join ' : ', ($level, $code, $message); } }
- Log in to post comments