Can I Upload Gz Files From Geo Datasets to Ipa Analysis

This page discusses how to load GEO SOFT format microarray data from the Factor Expression Omnibus database (GEO) (hosted by the NCBI) into R/BioConductor. SOFT stands for Simple Omnibus Format in Text. There are actually four types of GEO SOFT file available:

GEO Platform (GPL)
These files describe a particular type of microarray. They are annotation files.

GEO Sample (GSM)
Files that contain all the data from the utilise of a single flake. For each cistron there will exist multiple scores including the principal one, held in the VALUE column.

GEO Series (GSE)
Lists of GSM files that together form a unmarried experiment.

GEO Dataset (GDS)
These are curated files that hold a summarised combination of a GSE file and its GSM files. They contain normalised expression levels for each gene from each sample (i.e. only the VALUE field from the GSM file).

Every bit long as you merely need the expression level then a GDS file will suffice. If yous need to dig deeper into how the expression levels were calculated, yous'll demand to get all the GSM files instead (which are listed in the GDS or GSE file).

To me, it was natural to ask: How can I turn a GEO DataSet (GDS file) into an R/BioConductor expression fix object (exprSet)? (respond) And while we're at information technology, how to load the GEO Platform note (GPL file) likewise? (answer)

In the MOAC Module 5 assignment, the arroyo taken was to sanitize the information past manus, allowing information technology to be loaded into R with a uncomplicated telephone call to the read.table command. Its a expert idea to look at the raw files to understand what you are dealing with, but surely in that location is a more elegant way...

It turns out there are several existing GEO parsers, simply i stands out above all others: Sean Davis' GEOquery (released roughly Dec 2005).

Installing GEOquery

Bold you lot are running a recent version of BioConductor (1.8 or subsequently) you should be able to install it from within R every bit follows:

          > source("http://world wide web.bioconductor.org/biocLite.R") > biocLite("GEOquery")  Running bioCLite version 0.i.half dozen  with R version  2.3.1  ...        

For those of you on an older version of BioConductor, you will accept to download and install it by hand from here.

If y'all are using Windows, download GEOquery_1.half dozen.0.zip (or similar) and salve information technology. So from within the R plan, use the menu option "Packages", "Install parcel(s) from local naught files..." and select the ZIP file.

On Linux, download GEOquery_1.6.0.tar.gz (or similar) and use sudo R CMD INSTALL GEOquery_1.6.0.tar.gz at the control prompt.

Loading a GDS file with GEOquery

Here is a quick introduction to how to load a GDS file, and turn it into an expression set object:

          library(Biobase) library(GEOquery)  #Download GDS file, put it in the electric current directory, and load information technology: gds858 <- getGEO('GDS858', destdir=".")  #Or, open up an existing GDS file (even if its compressed): gds858 <- getGEO(filename='GDS858.soft.gz')        

I'm using GDS858 equally input. The SOFT file is bachelor in compressed course here GDS858.soft.gz, but GEOquery takes intendance of finding this file for you and unzipping it automatically.

Loading this file from the hard disk drive takes about two minutes on my laptop.

In that location are two main things the GDS object gives usa, meta data (from the file header) and a table of expression information. These are extracted using the Meta and Tabular array functions. Start lets have a await at the metadata:

          > Meta(gds858)$channel_count [1] "1" > Meta(gds858)$description [i] "Comparison of lung epithelial Calu-3 cells infected ..." >  Meta(gds858)$feature_count [1] "22283" >  Meta(gds858)$platform [1] "GPL96" > Meta(gds858)$sample_count [1] "19" > Meta(gds858)$sample_organism [1] "Homo sapiens" > Meta(gds858)$sample_type [1] "cDNA" > Meta(gds858)$title [1] "Mucoid and motile Pseudomonas aeruginosa infected lung epithelial cell comparison" > Meta(gds858)$type [i] "cistron expression assortment-based"        

Useful stuff, and at present the expression data table:

          > colnames(Table(gds858))  [ane] "ID_REF"     "IDENTIFIER" "GSM14498"   "GSM14499"   "GSM14500"    [6] "GSM14501"   "GSM14513"   "GSM14514"   "GSM14515"   "GSM14516"   [11] "GSM14506"   "GSM14507"   "GSM14508"   "GSM14502"   "GSM14503"   [16] "GSM14504"   "GSM14505"   "GSM14509"   "GSM14510"   "GSM14511"   [21] "GSM14512"  > Table(gds858)[1:10,1:half dozen]       ID_REF IDENTIFIER GSM14498 GSM14499 GSM14500 GSM14501 1  1007_s_at     U48705   3736.nine   3811.0   3699.half dozen   3897.6 2    1053_at     M87338    343.0    500.three    288.iii    341.3 iii     117_at     X51757    120.9     34.three    145.8    110.5 4     121_at     X69699   1523.8   1281.1   1281.ix   1493.4 5  1255_g_at     L36861     51.6     fifteen.9     45.9      8.1 half-dozen    1294_at     L13852    253.2    164.8    200.0    205.two 7    1316_at     X55005    199.6    250.seven    290.3    218.6 8    1320_at     X79510     81.7     xiii.four     xiii.9     88.vii 9  1405_i_at     M21121     18.9      5.6     11.0      9.5 10   1431_at     J02843     99.7     74.5     72.6    114.8        

Now, lets plow this GDS object into an expression set object (using base 2 logarithms) and have a await at it:

          > eset <- GDS2eSet(gds858, do.log2=TRUE) > eset Expression Set (exprSet) with          22283 genes         19 samples                  phenoData object with iv variables and 19 cases          varLabels                 : sample                 : infection                 : genotype/variation                 : description  > geneNames(eset)[1:10]  [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at"  [6] "1294_at"   "1316_at"   "1320_at"   "1405_i_at" "1431_at"  > sampleNames(eset)  [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513"  [vi] "GSM14514" "GSM14515" "GSM14516" "GSM14506" "GSM14507" [11] "GSM14508" "GSM14502" "GSM14503" "GSM14504" "GSM14505" [sixteen] "GSM14509" "GSM14510" "GSM14511" "GSM14512"        

GEOquery does an fantabulous task of extracting the phenotype information, as you tin see:

          > pData(eset)$infection  [ane] FRD1       FRD1       FRD1       FRD1       FRD440      [half dozen] FRD440     FRD440     FRD440     FRD875     FRD875     [11] FRD875     FRD875     FRD1234    FRD1234    FRD1234    [sixteen] uninfected uninfected uninfected uninfected Levels: FRD1 FRD1234 FRD440 FRD875 uninfected  > pData(eset)$"genotype/variation"  [ane] command                command                 [iii] control                control                 [5] mucoid                 mucoid                  [7] mucoid                 mucoid                  [9] motile                 motile                 [11] motile                 motile                 [13] not-mucoid, non-motile non-mucoid, non-motile [15] non-mucoid, non-motile non-mucoid, non-motile [17] non-mucoid, non-motile not-mucoid, non-motile [19] non-mucoid, not-motile Levels: control motile mucoid non-mucoid, non-motile        

Every bit with whatever expression fix object, its easy to pull out a subset of the data:

          > eset["1320_at","GSM14504"] Expression Set (exprSet) with          i genes         1 samples                  phenoData object with 4 variables and one cases          varLabels                 : sample                 : infection                 : genotype/variation                 : description  > exprs(eset["1320_at","GSM14504"])         GSM14504 1320_at  6.70044        

You lot should be able to produce a heatmap of differentially expressed genes easily enough using this page, especially as the phenotype/sub-sample information has been sorted out for you.

Loading a GPL (Annotation) file with GEOquery

In improver to loading a GDS file to get the expression levels, y'all can likewise load the associated platform annotation file. You can find this out from the GDS858 meta information:

          >  Meta(gds858)$platform [ane] "GPL96"        

Then, for GDS858, the platform is GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A.

At present let's load upwards the GPL file and have a look at information technology (its a big file, about 12 MB, and so this takes a while!):

          library(Biobase) library(GEOquery)  #Download GPL file, put it in the current directory, and load information technology: gpl96 <- getGEO('GPL96', destdir=".")  #Or, open up an existing GPL file: gpl96 <- getGEO(filename='GPL96.soft')        

As with the GDS object, we can employ the Meta and Table functions to extract information:

          > Meta(gpl96)$championship [1] "Affymetrix GeneChip Human Genome U133 Array Set HG-U133A"  > colnames(Tabular array(gpl96))  [1] "ID"                               "Species.Scientific.Name"           [3] "Notation.Appointment"                  "GB_LIST"                           [5] "SPOT_ID"                          "Sequence.Source"                   [7] "Representative.Public.ID"         "Gene.Title"                        [9] "Cistron.Symbol"                      "Entrez.Gene"                      [11] "RefSeq.Transcript.ID"             "Gene.Ontology.Biological.Procedure" [13] "Factor.Ontology.Cellular.Component" "Cistron.Ontology.Molecular.Function"                  

Lets expect at the first four columns, for the first x genes:

          > Table(gpl96)[1:10,one:four]           ID Species.Scientific.Name Annotation.Appointment GB_LIST one  1007_s_at            Homo sapiens       xvi-Sep-05  U48705 2    1053_at            Human being sapiens       16-Sep-05  M87338 3     117_at            Human sapiens       16-Sep-05  X51757 4     121_at            Homo sapiens       16-Sep-05  X69699 five  1255_g_at            Homo sapiens       sixteen-Sep-05  L36861 6    1294_at            Human being sapiens       16-Sep-05  L13852 7    1316_at            Homo sapiens       16-Sep-05  X55005 8    1320_at            Homo sapiens       16-Sep-05  X79510 9  1405_i_at            Homo sapiens       16-Sep-05  M21121 10   1431_at            Man sapiens       16-Sep-05  J02843                  

This shows a paw picked selection of the columns, again for the first x genes:

          > Table(gpl96)[one:ten,c("ID","GB_LIST","Gene.Title","Cistron.Symbol","Entrez.Gene")]           ID GB_LIST                                            Factor.Title Gene.Symbol Entrez.Cistron 1  1007_s_at  U48705            discoidin domain receptor family unit, fellow member 1        DDR1         780 two    1053_at  M87338           replication factor C (activator ane) 2, 40kDa        RFC2        5982 3     117_at  X51757                  heat shock 70kDa protein 6 (HSP70B')       HSPA6        3310 4     121_at  X69699                                     paired box gene 8        PAX8        7849 5  1255_g_at  L36861               guanylate cyclase activator 1A (retina)      GUCA1A        2978 6    1294_at  L13852                   ubiquitin-activating enzyme E1-like       UBE1L        7318 7    1316_at  X55005   thyroid hormone receptor, blastoff (erythroblastic...)        THRA        7067 8    1320_at  X79510    protein tyrosine phosphatase, not-receptor type 21      PTPN21       11099 9  1405_i_at  M21121                        chemokine (C-C motif) ligand 5        CCL5        6352 10   1431_at  J02843 cytochrome P450, family 2, subfamily E, polypeptide one      CYP2E1        1571                  

The above all used the 12MB file GPL96.soft, but you lot can too go a much smaller 3MB file GPL96.annot (compressed as GPL96.annot.gz) which has slightly dissimilar information in information technology... come across here.

Using the BioConductor hgu133a package

Instead of loading the GEO annotation file for GPL96/HG-U133A, we could employ an existing annotation package from the BioConductor note sets, hgu133a. These libraries be for most of the pop microarray gene chips.

First of all, we need to install the packet:

          > source("http://www.bioconductor.org/biocLite.R") > biocLite("hgu133a")  Running bioCLite version 0.ane  with R version  2.1.ane  ...        

Then nosotros tin load the newly installed library:

          > library(hgu133a)        

There is whatever easy way to bank check when this was lasted updated, and what it tin can translate the Affy probe names into:

          > hgu133a()  Quality control information for  hgu133a  Date built: Created: Tue May 17 xiii:02:12 2005     Number of probes: 22277  Probe number missmatch: None  Probe missmatch: None  Mappings found for probe based rda files:           hgu133aACCNUM found 22277 of 22277          hgu133aCHRLOC establish 20195 of 22277          hgu133aCHR found 21283 of 22277          hgu133aENZYME found 2507 of 22277          hgu133aGENENAME found 18726 of 22277          hgu133aGO found 18647 of 22277          hgu133aLOCUSID found 21747 of 22277          hgu133aMAP found 21183 of 22277          hgu133aOMIM establish 15109 of 22277          hgu133aPATH found 5067 of 22277          hgu133aPMID found 21004 of 22277          hgu133aREFSEQ institute 21002 of 22277          hgu133aSUMFUNC establish 0 of 22277          hgu133aSYMBOL found 21303 of 22277          hgu133aUNIGENE found 21128 of 22277  Mappings establish for non-probe based rda files:          hgu133aCHRLENGTHS found 25          hgu133aENZYME2PROBE institute 663          hgu133aGO2ALLPROBES found 5912          hgu133aGO2PROBE plant 4326          hgu133aORGANISM establish 1          hgu133aPATH2PROBE found 142          hgu133aPMID2PROBE constitute 96291        

And now lets test some of those mappings on the fourth gene 121_at in the GPL file:

          > Tabular array(gpl96)[iv,c("ID","GB_LIST","Gene.Championship","Gene.Symbol","Entrez.Gene")]       ID   GB_LIST          Cistron.Championship   Gene.Symbol   Entrez.Gene iv 121_at    X69699   paired box cistron eight          PAX8          7849        

Now, what does the note file have to say?

          > mget("121_at",hgu133aACCNUM) $"121_at" [1] "X69699"  > mget("121_at",hgu133aGENENAME) $"121_at" [1] "paired box cistron 8"  > mget("121_at",hgu133aSYMBOL) $"121_at" [one] "PAX8"  > mget("121_at",hgu133aUNIGENE) $"121_at" [one] "Hs.469728"        

You will detect that at that place is some overlap betwixt the information in the GEO annotation tabular array, and the hgu133a parcel (which compiles its data from a range of sources). See help(hgu133a) .

Yous should also read this introduction, Bioconductor: Annotation Package Overview [PDF].

kimesrecoughtell.blogspot.com

Source: https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/

0 Response to "Can I Upload Gz Files From Geo Datasets to Ipa Analysis"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel