Monday, September 29, 2014

Afontova Gora-2 DNA

Similar to the ancient Malta boy's DNA, the same paper also provided raw data as BAM for another ancient DNA from Afontova Gora-2, located on the western bank of the Enisei River in south-central Siberia. I converted the raw data supplied in this scientific paper to formats familiar with genetic genealogists. I also filtered with SNPs tested by DNA testing companies like FTDNA, 23andMe and Ancestry in order to upload to GEDMatch but found this ancient DNA has less SNPs (~47000) that are common with them. Hence, I did not upload this to GEDMatch. I also reprocessed from sequence read run files but did not make much difference, but making both available below.

Download: 
Reference:
Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans.
Raghavan, Maanasa, Pontus Skoglund, Kelly E. Graf, Mait Metspalu, Anders Albrechtsen, Ida Moltke, Simon Rasmussen et al. "Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans." Nature (2013).

Data Used

Sunday, September 28, 2014

La Braña-Arintero DNA

Approximately 7,000-year-old Mesolithic skeleton discovered at the La Braña-Arintero site in León, Spain, had been sequenced to retrieve a complete pre-agricultural European human genome and the sequence reads were made available to public by the authors. I converted these raw sequence reads supplied in the scientific paper to formats familiar with genetic genealogists.

Download: 
Reference:
Derived immune and ancestral pigmentation alleles in a 7,000-year-old Mesolithic European.
Olalde, Iñigo, Morten E. Allentoft, Federico Sánchez-Quinto, Gabriel Santpere, Charleston WK Chiang, Michael DeGiorgio, Javier Prado-Martinez et al. "Derived immune and ancestral pigmentation alleles in a 7,000-year-old Mesolithic European." Nature 507, no. 7491 (2014): 225-228.

Data Used

Friday, September 26, 2014

Tianyuan DNA

Ancient DNA of human from Tianyuan Cave outside Beijing, China. I converted the raw data supplied in the scientific paper to formats familiar with genetic genealogists. Please note this download contains only data for chromosome 21.

Download: 
Reference:
DNA analysis of an early modern human from Tianyuan Cave, China.
Fu, Qiaomei, Matthias Meyer, Xing Gao, Udo Stenzel, Hernán A. Burbano, Janet Kelso, and Svante Pääbo. "DNA analysis of an early modern human from Tianyuan Cave, China." Proceedings of the National Academy of Sciences 110, no. 6 (2013): 2223-2227. (Ref)

Data Used

Mal’ta MA-1 DNA

The origins of the First Americans remain contentious. Although Native Americans seem to be genetically most closely related to east Asians there is no consensus with regard to which specific Old World populations they are closest to. Here the authors sequence an ancient genome of individual (MA-1), from Mal’ta in south-central Siberia, to an average depth of 1x. Based on the author knowledge, this MA-1 DNA is the oldest anatomically modern human genome reported to date. I converted this raw sequence reads supplied in this scientific paper to formats familiar with genetic genealogists.

Download: 
Reference:
Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans.
Raghavan, Maanasa, Pontus Skoglund, Kelly E. Graf, Mait Metspalu, Anders Albrechtsen, Ida Moltke, Simon Rasmussen et al. "Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans." Nature (2013).

Data Used
Related Blogs

Thursday, September 25, 2014

Autosomal Compare (*nix)

For comparing two autosomal files, there is a Windows tool called, Autosomal Segment Analyzer. However, in Unix-based systems, you can compare using the below command. The below command assumes that the autosomal files are in 23andMe format. If you have autosomal files in any other format, you should be able to convert using the commands provided in Autosomal Converter for *nix page

Prerequisites: Any Unix-based system

$ join --nocheck-order -e EMPTY --header file1.txt file2.txt |awk 'BEGIN { snp_threshold = 700; mb_threshold = 7; error_radius = 350; largest_mb = 0; total_mb = 0; count=0; seg_start=0;seg_end=0; chr=0; pchr=0; error_pos=0; print "\nChr\tStart Position\tEnd Position\tLen(Mb)\tSNPs";} { chr = $2; seg_len = (seg_end-seg_start)/1000000; if( !($4 == $7 || substr($4,1,1) == substr($7,1,1)|| substr($4,2,1) == substr($7,2,1) || substr($4,1,1) == substr($7,2,1)|| substr($4,2,1) == substr($7,1,1) ) || pchr!=chr) { if( seg_end - error_pos > error_radius ) { count++; seg_end = $3; } else { if( count > snp_threshold && seg_len > mb_threshold) { total_mb = total_mb + seg_len; if(largest_mb < seg_len) largest_mb = seg_len; print chr"\t"seg_start"\t"seg_end"\t"seg_len"\t"count; } count = 0; seg_start = $3; } error_pos=$3; } else { count++; seg_end = $3; } pchr = chr;}END {print "\nLargest Segment: "largest_mb" Mb";print "Total Shared: "total_mb" Mb\n";}'

Note: file1.txt and file2.txt are the two files being compared. The SNP threshold of 700, Mb Threshold of 7 and error radius of 350 SNPs can be modified which are bolded for convenience.

Screenshot:

FASTA to RSRS (*nix)

You might have known a windows tool to get RSRS markers from FASTA called FASTA to RSRS (With Visualizer). However, in Unix-like platforms, you can extract right from console. The following commands will help you get RSRS markers from FASTA mtDNA file.

Prerequisites:

  • Any Unix-based system
  • Connected to internet (or) RSRS.fasta is downloaded and kept in current directory. If manually downloaded, the wget command can be skipped.


Commands:
$ wget http://www.phylotree.org/resources/RSRS.fasta
$ cat RSRS.fasta |tail -n +2|sed ':a;N;$!ba;s/\n//g' |sed -E "s/([ATGCN])/\1\n/g" > RSRS.seq
$ cat input.fasta |tail -n +2|sed ':a;N;$!ba;s/\n//g' |sed -E "s/([ATGCN])/\1\n/g" > input.seq
$ diff -B --old-line-format=' %l%3dn' --new-line-format='%L'  --suppress-common-lines RSRS.seq input.seq|grep -P '([0-9])+'|sed s/\\s/\\n/g |  sed '/^\s*$/d'|grep -P -v 'N[0-9]+'

Note: The input.fasta is your fasta input file.

Screenshot:
Output for how the RSRS markers will be displayed.


Autosomal DNA Converter (*nix)

In Unix-like environments, converting DNA files can be easily done using simple commands. For windows, you can always use Autosomal DNA Converter (Windows) tool for conversion. Unlike Windows, Unix-based systems don't require any special tools for conversion and the below commands works out-of-box.

Prerequisites: Any Unix-based system

Converting 23andMe to FamilyTreeDNA format

$ echo "RSID,CHROMOSOME,POSITION,RESULT" > output.csv
$ cat input.txt|grep -v '#'|awk -F'\t' '{ print "\""$1"\",\""$2"\",\""$3"\",\""$4"\""; }' >> output.csv

Note: The input.txt is the 23andMe autosomal file and the output.csv will be in FTDNA format.

Converting Ancestry to FamilyTreeDNA format

$ echo "RSID,CHROMOSOME,POSITION,RESULT" > output.csv
$ cat input.txt|grep -v '#'|grep -v 'rsid'|awk -F'\t' '{ print "\""$1"\",\""$2"\",\""$3"\",\""$4$5"\""; }'|sed s/,\"23\",/,\"X\",/g|grep -v '2[4|5]' > output.csv

Note: The input.txt is the Ancestry autosomal file and the output.csv will be in FTDNA format.

Converting FamilyTreeDNA to 23andMe format

$ echo -e "# rsid\tchromosome\tposition\tgenotype" > output.txt
$ cat input.csv|tail -n +2|cut -d, -f1,2,3,4|sed s/\"//g|sed s/,/\\t/g >> output.txt

Note: The input.csv is the FTDNA autosomal file and the output.txt will be in 23andMe format.


Converting Ancestry to 23andMe format

$ echo -e "# rsid\tchromosome\tposition\tgenotype" > output.txt
$ cat input.txt|grep -v '#'|grep -v 'rsid'|awk -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }'|sed s/\\t23\\t/\\tX\\t\/g |sed s/\\t24\\t/\\tY\\t\/g| grep -P -v '\t25\t' >> output.txt

Note: The input.txt is the Ancestry autosomal file and the output.txt will be in FTDNA format.


Converting 23andMe to Ancestry format

$ echo -e "rsid\tchromosome\tposition\tallele1\tallele2" > output.txt
$ cat input.txt|grep -v '#'|grep -v 'rsid'|awk -F'\t' '{ print $1"\t"$2"\t"$3"\t"substr($4,1,1)"\t"substr($4,2,1); }'|sed s/\\tX\\t/\\t23\\t/g | sed s/\\tY\\t/\\t24\\t/g >> output.txt

Note: The input.txt is the 23andMe autosomal file and the output.txt will be in Ancestry format.

Converting FamilyTreeDNA to Ancestry format

$ echo -e "rsid\tchromosome\tposition\tallele1\tallele2" > output.txt
$ cat input.csv|tail -n +2|cut -d, -f1,2,3,4|sed s/\"//g|sed s/,/\\t/g | awk -F'\t' '{ print $1"\t"$2"\t"$3"\t"substr($4,1,1)"\t"substr($4,2,1); }'|sed s/\\tX\\t/\\t23\\t/g | sed s/\\tY\\t/\\t24\\t/g >> output.txt

Note: The input.csv is the FTDNA autosomal file and the output.txt will be in Ancestry format.

Saturday, September 20, 2014

GGK CLI Import Tool

The project is aimed at importing autosomal files into Genetic Genealogy Kit (GGK) using console/command line. This helps to mass import into the database instead of drag and drop. It supports both FTDNA, 23andMe and Ancestry raw data files.

Prerequisites:
Usage:
Syntax:
ggkimport <autosomal-file> <ggk.db-path> <kit-no> <kit-name>


E.g,
ggkimport D:\264652-autosomal-o36-results.csv D:\GGKv2\gk.db 264652 Felix

Download : ggkimport.zip (641 KB)

Source Code at GitHub.

Change Log
Version 1.0
  • Initial release.

Clovis-Anzick DNA

Clovis, with its distinctive biface, blade and osseous technologies, is the oldest widespread archaeological complex defined in North America. The genome sequence of a male infant (Anzick-1) recovered from the Anzick burial site in western Montana is available for download. I took this data and converted into familiar formats to genetic genealogists and also uploaded to GEdMatch.

Download: 
  • GEDMatch# F999919 (Processed from BAM)
  • Download from Google Drive (Complete).
Reference:
The genome of a Late Pleistocene human from a Clovis burial site in western Montana.
Rasmussen M, Anzick SL, Waters MR, Skoglund P, DeGiorgio M, Stafford TW Jr, Rasmussen S, Moltke I, Albrechtsen A, Doyle SM, Poznik GD, Gudmundsdottir V, Yadav R, Malaspinas AS, White SS 5th, Allentoft ME, Cornejo OE, Tambets K, Eriksson A, Heintzman PD, Karmin M, Korneliussen TS, Meltzer DJ, Pierre TL, Stenderup J, Saag L, Warmuth VM, Lopes MC, Malhi RS, Brunak S, Sicheritz-Ponten T, Barnes I, Collins M, Orlando L, Balloux F, Manica A, Gupta R, Metspalu M, Bustamante CD, Jakobsson M, Nielsen R, Willerslev E
Nature. 2014 Feb 13;506(7487):225-9., 2014 PubMed

Data Used
Related Blog

Mezmaiskaya Neanderthal DNA

The Neanderthal genome project is a collaboration of scientists coordinated by the Max Planck Institute for Evolutionary Anthropology in Germany and 454 Life Sciences in the United States to sequence the Neanderthal genome.

This project aims to convert the massive amount of data of Neanderthal Genome to a raw data download familiar to genetic genealogists. So, basically, I am just extracting the SNPs from Neanderthal Genome and constructing the autosomal raw data file along with mtDNA and Y-DNA. Due to less number of SNPs common with DNA testing companies, I am not uploading it to GEDMatch.

Download: 
License:
Use of the genome sequence data section in Department of Evolutionary Genetics reads,
All data is made freely available. However, we ask users to observe the Ft. Lauderdale principles, which entitles the data producers to make the first presentation and publish the first genome-wide analysis of the data. The data can be used freely for studies of individual genes or other individual features of the genome.
I am using it for hobby, so I think I don't fall under that category.

References / Data Used