Can I Upload Gz Files From Geo Datasets to Ipa Analysis
This page discusses how to load GEO SOFT format microarray data from the Factor Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:
GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.
GEO Sample (GSM)
Files that contain all the data from the utilise of a single flake. For each cistron there will exist multiple scores including the principal one, held in the VALUE column.
GEO Series (GSE)
Lists of GSM files that together form a unmarried experiment.
GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. only the VALUE field from the GSM file).
Every bit long as you merely need the expression level then a GDS file will suffice. If yous need to dig deeper into how the expression levels were calculated, yous'll demand to get all the GSM files instead (which are listed in the GDS or GSE file).
To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression fix object (exprSet)? (respond) And while we're at information technology, how to load the GEO Platform note (GPL file) likewise? (answer)
In the MOAC Module 5 assignment, the arroyo taken was to sanitize the information past manus, allowing information technology to be loaded into R with a uncomplicated telephone call to the read.table command. Its a expert idea to look at the raw files to understand what you are dealing with, but surely in that location is a more elegant way...
It turns out there are several existing GEO parsers, simply i stands out above all others: Sean Davis' GEOquery (released roughly Dec 2005).
Installing GEOquery
Bold you lot are running a recent version of BioConductor (1.8 or subsequently) you should be able to install it from within R every bit follows:
> source("http://world wide web.bioconductor.org/biocLite.R") > biocLite("GEOquery") Running bioCLite version 0.i.half dozen with R version 2.3.1 ... For those of you on an older version of BioConductor, you will accept to download and install it by hand from here.
If y'all are using Windows, download GEOquery_1.half dozen.0.zip (or similar) and salve information technology. So from within the R plan, use the menu option "Packages", "Install parcel(s) from local naught files..." and select the ZIP file.
On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the control prompt.
Loading a GDS file with GEOquery
Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:
library(Biobase) library(GEOquery) #Download GDS file, put it in the electric current directory, and load information technology: gds858 <- getGEO('GDS858', destdir=".") #Or, open up an existing GDS file (even if its compressed): gds858 <- getGEO(filename='GDS858.soft.gz') I'm using GDS858 equally input. The SOFT file is bachelor in compressed course here GDS858.soft.gz, but GEOquery takes intendance of finding this file for you and unzipping it automatically.
Loading this file from the hard disk drive takes about two minutes on my laptop.
In that location are two main things the GDS object gives usa, meta data (from the file header) and a table of expression information. These are extracted using the Meta and Tabular array functions. Start lets have a await at the metadata:
> Meta(gds858)$channel_count [1] "1" > Meta(gds858)$description [i] "Comparison of lung epithelial Calu-3 cells infected ..." > Meta(gds858)$feature_count [1] "22283" > Meta(gds858)$platform [1] "GPL96" > Meta(gds858)$sample_count [1] "19" > Meta(gds858)$sample_organism [1] "Homo sapiens" > Meta(gds858)$sample_type [1] "cDNA" > Meta(gds858)$title [1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison" > Meta(gds858)$type [i] "cistron expression assortment-based"
Useful stuff, and at present the expression data table:
> colnames(Table(gds858)) [ane] "ID_REF" "IDENTIFIER" "GSM14498" "GSM14499" "GSM14500" [6] "GSM14501" "GSM14513" "GSM14514" "GSM14515" "GSM14516" [11] "GSM14506" "GSM14507" "GSM14508" "GSM14502" "GSM14503" [16] "GSM14504" "GSM14505" "GSM14509" "GSM14510" "GSM14511" [21] "GSM14512" > Table(gds858)[1:10,1:half dozen] ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501 1 1007_s_at U48705 3736.nine 3811.0 3699.half dozen 3897.6 2 1053_at M87338 343.0 500.three 288.iii 341.3 iii 117_at X51757 120.9 34.three 145.8 110.5 4 121_at X69699 1523.8 1281.1 1281.ix 1493.4 5 1255_g_at L36861 51.6 fifteen.9 45.9 8.1 half-dozen 1294_at L13852 253.2 164.8 200.0 205.two 7 1316_at X55005 199.6 250.seven 290.3 218.6 8 1320_at X79510 81.7 xiii.four xiii.9 88.vii 9 1405_i_at M21121 18.9 5.6 11.0 9.5 10 1431_at J02843 99.7 74.5 72.6 114.8
Now, lets plow this GDS object into an expression set object (using base 2 logarithms) and have a await at it:
> eset <- GDS2eSet(gds858, do.log2=TRUE) > eset Expression Set (exprSet) with 22283 genes 19 samples phenoData object with iv variables and 19 cases varLabels : sample : infection : genotype/variation : description > geneNames(eset)[1:10] [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" [6] "1294_at" "1316_at" "1320_at" "1405_i_at" "1431_at" > sampleNames(eset) [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513" [vi] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507" [11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505" [sixteen] "GSM14509" "GSM14510" "GSM14511" "GSM14512"
GEOquery does an fantabulous task of extracting the phenotype information, as you tin see:
> pData(eset)$infection [ane] FRD1 FRD1 FRD1 FRD1 FRD440 [half dozen] FRD440 FRD440 FRD440 FRD875 FRD875 [11] FRD875 FRD875 FRD1234 FRD1234 FRD1234 [sixteen] uninfected uninfected uninfected uninfected Levels: FRD1 FRD1234 FRD440 FRD875 uninfected > pData(eset)$"genotype/variation" [ane] command command [iii] control control [5] mucoid mucoid [7] mucoid mucoid [9] motile motile [11] motile motile [13] not-mucoid, non-motile non-mucoid, non-motile [15] non-mucoid, non-motile non-mucoid, non-motile [17] non-mucoid, non-motile not-mucoid, non-motile [19] non-mucoid, not-motile Levels: control motile mucoid non-mucoid, non-motile
Every bit with whatever expression fix object, its easy to pull out a subset of the data:
> eset["1320_at","GSM14504"] Expression Set (exprSet) with i genes 1 samples phenoData object with 4 variables and one cases varLabels : sample : infection : genotype/variation : description > exprs(eset["1320_at","GSM14504"]) GSM14504 1320_at 6.70044
You lot should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.
Loading a GPL (Annotation) file with GEOquery
In improver to loading a GDS file to get the expression levels, y'all can likewise load the associated platform annotation file. You can find this out from the GDS858 meta information:
> Meta(gds858)$platform [ane] "GPL96"
Then, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.
At present let's load upwards the GPL file and have a look at information technology (its a big file, about 12 MB, and so this takes a while!):
library(Biobase) library(GEOquery) #Download GPL file, put it in the current directory, and load information technology: gpl96 <- getGEO('GPL96', destdir=".") #Or, open up an existing GPL file: gpl96 <- getGEO(filename='GPL96.soft') As with the GDS object, we can employ the Meta and Table functions to extract information:
> Meta(gpl96)$championship [1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A" > colnames(Tabular array(gpl96)) [1] "ID" "Species.Scientific.Name" [3] "Notation.Appointment" "GB_LIST" [5] "SPOT_ID" "Sequence.Source" [7] "Representative.Public.ID" "Gene.Title" [9] "Cistron.Symbol" "Entrez.Gene" [11] "RefSeq.Transcript.ID" "Gene.Ontology.Biological.Procedure" [13] "Factor.Ontology.Cellular.Component" "Cistron.Ontology.Molecular.Function"
Lets expect at the first four columns, for the first x genes:
> Table(gpl96)[1:10,one:four] ID Species.Scientific.Name Annotation.Appointment GB_LIST one 1007_s_at Homo sapiens xvi-Sep-05 U48705 2 1053_at Human being sapiens 16-Sep-05 M87338 3 117_at Human sapiens 16-Sep-05 X51757 4 121_at Homo sapiens 16-Sep-05 X69699 five 1255_g_at Homo sapiens sixteen-Sep-05 L36861 6 1294_at Human being sapiens 16-Sep-05 L13852 7 1316_at Homo sapiens 16-Sep-05 X55005 8 1320_at Homo sapiens 16-Sep-05 X79510 9 1405_i_at Homo sapiens 16-Sep-05 M21121 10 1431_at Man sapiens 16-Sep-05 J02843
This shows a paw picked selection of the columns, again for the first x genes:
> Table(gpl96)[one:ten,c("ID","GB_LIST","Gene.Title","Cistron.Symbol","Entrez.Gene")] ID GB_LIST Factor.Title Gene.Symbol Entrez.Cistron 1 1007_s_at U48705 discoidin domain receptor family unit, fellow member 1 DDR1 780 two 1053_at M87338 replication factor C (activator ane) 2, 40kDa RFC2 5982 3 117_at X51757 heat shock 70kDa protein 6 (HSP70B') HSPA6 3310 4 121_at X69699 paired box gene 8 PAX8 7849 5 1255_g_at L36861 guanylate cyclase activator 1A (retina) GUCA1A 2978 6 1294_at L13852 ubiquitin-activating enzyme E1-like UBE1L 7318 7 1316_at X55005 thyroid hormone receptor, blastoff (erythroblastic...) THRA 7067 8 1320_at X79510 protein tyrosine phosphatase, not-receptor type 21 PTPN21 11099 9 1405_i_at M21121 chemokine (C-C motif) ligand 5 CCL5 6352 10 1431_at J02843 cytochrome P450, family 2, subfamily E, polypeptide one CYP2E1 1571 The above all used the 12MB file GPL96.soft, but you lot can too go a much smaller 3MB file GPL96.annot (compressed as GPL96.annot.gz) which has slightly dissimilar information in information technology... come across here.
Using the BioConductor hgu133a package
Instead of loading the GEO annotation file for GPL96/HG-U133A, we could employ an existing annotation package from the BioConductor note sets, hgu133a. These libraries be for most of the pop microarray gene chips.
First of all, we need to install the packet:
> source("http://www.bioconductor.org/biocLite.R") > biocLite("hgu133a") Running bioCLite version 0.ane with R version 2.1.ane ... Then nosotros tin load the newly installed library:
> library(hgu133a)
There is whatever easy way to bank check when this was lasted updated, and what it tin can translate the Affy probe names into:
> hgu133a() Quality control information for hgu133a Date built: Created: Tue May 17 xiii:02:12 2005 Number of probes: 22277 Probe number missmatch: None Probe missmatch: None Mappings found for probe based rda files: hgu133aACCNUM found 22277 of 22277 hgu133aCHRLOC establish 20195 of 22277 hgu133aCHR found 21283 of 22277 hgu133aENZYME found 2507 of 22277 hgu133aGENENAME found 18726 of 22277 hgu133aGO found 18647 of 22277 hgu133aLOCUSID found 21747 of 22277 hgu133aMAP found 21183 of 22277 hgu133aOMIM establish 15109 of 22277 hgu133aPATH found 5067 of 22277 hgu133aPMID found 21004 of 22277 hgu133aREFSEQ institute 21002 of 22277 hgu133aSUMFUNC establish 0 of 22277 hgu133aSYMBOL found 21303 of 22277 hgu133aUNIGENE found 21128 of 22277 Mappings establish for non-probe based rda files: hgu133aCHRLENGTHS found 25 hgu133aENZYME2PROBE institute 663 hgu133aGO2ALLPROBES found 5912 hgu133aGO2PROBE plant 4326 hgu133aORGANISM establish 1 hgu133aPATH2PROBE found 142 hgu133aPMID2PROBE constitute 96291
And now lets test some of those mappings on the fourth gene 121_at in the GPL file:
> Tabular array(gpl96)[iv,c("ID","GB_LIST","Gene.Championship","Gene.Symbol","Entrez.Gene")] ID GB_LIST Cistron.Championship Gene.Symbol Entrez.Gene iv 121_at X69699 paired box cistron eight PAX8 7849 Now, what does the note file have to say?
> mget("121_at",hgu133aACCNUM) $"121_at" [1] "X69699" > mget("121_at",hgu133aGENENAME) $"121_at" [1] "paired box cistron 8" > mget("121_at",hgu133aSYMBOL) $"121_at" [one] "PAX8" > mget("121_at",hgu133aUNIGENE) $"121_at" [one] "Hs.469728" You will detect that at that place is some overlap betwixt the information in the GEO annotation tabular array, and the hgu133a parcel (which compiles its data from a range of sources). See help(hgu133a) .
Yous should also read this introduction, Bioconductor: Annotation Package Overview
.
Source: https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/
0 Response to "Can I Upload Gz Files From Geo Datasets to Ipa Analysis"
Postar um comentário