GenABEL: an annoying error after the import of PLINK data format

In the previous post we saw how much convenient could be GenABEL in the management of genotypic/phenotypic data.
We introduced the import of genotypic data from an Illumina format file:
> convert.snp.illumina(inf = "gen.illu", out = "gen.raw", strand = "file")
but what happens if you're analysing your data with PLINK, the open source toolset for GWAS?


GenABEL provides an useful function:

> convert.snp.tped(tped = "dataset.tped", tfam="dataset.tfam", out = "gen.raw", strand = "+")

This function allows you to import the transposed-ped format file and the tfam format file and it permits to convert them into the .raw file which contains the genotypic raw data readable by load.gwaa.data function.
The conversion is performed by C++ code that is both fast and memory efficient.

It's all great. But..

when you use the function:

> df <- load.gwaa.data(phe = phe.txt", gen = "gen.raw")

you could receive this warning:

person with id = a123456 was not found in genotypic file; excluded

Good! In your phenotypic dataset there are more person than in the genotypic dataset, GenABEL understands that 'mistake' and it doesn't load that person.

What happens you are adding a person who is present in the genotypic file and not in the phenotypic file?

person with id = 654321a was not found in phenotypic file!!! - FATAL

Bad. You'll get an error!

So you need to fix your dataset: you have to delete persons, which aren't present in phenotypic data, from your genotypic dataset. This kind of problem could happen when you receive the datasets from another person (it's my case).
You could edit the .tfam o .tped files. You should check the persons present in your phenotypic file, then you have to take off the rows in the .tfam file corresponding to those persons and the corresponding columns in the .tped file.

There could be another solution:

> convert.snp.tped(tped = "genabelGWASALL.tped", tfam="genabelGWASALL.tfam", out = "genot.raw", strand = "+")
> file_tfam <- read.table(file="genabelGWASALL.tfam", sep=' ', head=FALSE) # genotypes
> file_phe <- read.table(file="phe.txt", sep=' ', head=TRUE) # phenotypes

> ot_id ← setdiff(file_tfam[,2], file_phe$id) # ids present only in tfam (and tped) file
> lot_id ← length(ot_id)

> fakepheno <- data.frame(id = ot_id, sex=rep('1', lot_id), phen = rep('NA', lot_id))
> names(fakepheno) <- names(file_phe)
> fennew <- rbind(file_phe, fakepheno)

Now you can create the new phenotypic file:

> write.table(fennew, file="newpheno.txt", row.names=FALSE, col.names=TRUE, sep=' ')

Finally you can use load.gwaa.data function and move on with your analysis!

> newdataset <- load.gwaa.data(phe = "newpheno.txt", gen = "genot.raw",force = T)

Remenber to delete the fake phenotypes:

> newdataset1 <- newdataset[newdataset@phdata$id %in% file_phe$id,]

1
Shares

About Andrea Pedretti

M.Sc. in Bioinformatics at Università Milano-Bicocca. Bioinformatician at Parco Tecnologico Padano: research focused on genetic association analyses in plants and animals.
This entry was posted in R and tagged , . Bookmark the permalink.

Leave a Reply