Tunis Workshop 2012
Finding genes and the gene page
Blast exercises
BLASTp and BLASTn
A research group identified a gene from patients with disturbed sleeping patterns:
Nucleotide sequence:
gggtgaacag ccgcacggga gtaggtacgc acctgacctc gctggcactg
ccgggcaagg cagagggtgt ggcgtcgctc accagccagt gcagctacag
cagcaccatc gtccatgtgg gagacaagaa gccgcagccg gagttagaga
tggtggaaga tgctgcgagt gggccagaat
Translation:
VNSRTGVGTHLTSLALPGKAEGVASLTSQCSYSSTIVHVGDKKPQPELEMVEDAASGPE
1. Perform Blastn (Blast and Megablast) and Blastp searches in NCBI BLAST. Use the "Nucleotide Collection" and "Non-redundant protein" databases, respectively. Are both searches giving the same most likely hit? Are E-values different? Why?
2. What is the difference between BLASTN and MEGABLAST results? To see this focus on distant homologs found by both searches (you can get the taxonomy/organism/lineage report by clicking on "Taxonomy Report" link on the top of the results page). Are you able to find an homolog in, for example, Xenopus in both cases? If not, why?
3. Which gene might this be?
Frog annotation
The frog genome is not yet fully annotated. Many proteins are predicted but the annotation of their function is far from being complete.
X. laevis and X. tropicalis
1. Predict the function of the following frog sequence based on their homologs (using NCBI BLAST).
>Frog_protein
RAPNHKANDISGSCMRTTSSQNSNQAPSISNQQHQAPSLPPSSPSHGNSSAQKKSKSSNS
SGSSQATLSKQIFPWMKESRQNAKQKNTSSSSSTPPGENCEEKSPT
2. Observe how E values change with database size.
3. Set low-complexity filter on (Look at Algorithm parameters). Take a look at the alignments, what difference does this filter make? Hints
4. Try to find most similar nucleotide sequence for this protein sequence in nr database. (Look carefully to the list of possible BLAST searches and choose one that can do this). Which frog species (Xenopus tropicalis or Xenopus laevis) does this sequence originate from?
Parasite genome database
EuPathDB
DNA sequence
> Unknown gene
GTATTTTTAATTCCTTATAAGTGTAAATACACTTTTACAATAATGGAAAATTTGGGGAAA
ATGAGTAATCCAAATCAAAATCTTGGAGCTCAACATCTTAAAGTGAGCGGAGAAAATCAG
CATATTGAGCATGATTATGATGAGCCAACCATTCCAGCAGATTGTTGGAATGAATCACAA
GAGGATTACAAAGTTGGAGGTTACCACCCAGTATCGATTGGGGAGGTTTATAATGGTAGA
TATTTGATTGTTTCGAAGTTAGGATGGGGCCATTTTTCAACAGTATGGTTGGCTATTGAC
ACTTTAAGCACACCTACTACTTATTTTGCTTTAAAGTTCCAAAAAGGAGCCCAAGAGTAC
CGACAGGCTGCATATGACGAAATGGAGATATTAACGGCTACTAAGAACCATGCTAGTGGA
GAGGAGTGGAGGGAATCTTTGAATAGACATTTGGAAAGTTGTATTGAGAATTTTACTCGT
CCATTTTCAAGGAACTTTAATGGTGTTGTAGGTTTTATTGATTACTTTGAGGTTTCAGGT
CCAAATGGCCAACATGTTTGTATGGTATTTGAAGTATTGGGGCCGAATATCTTGCAATTA
ATCAGCTTGTATGACTATAAAGGAGTTCCAATTGATATAGTCAGGAAGATTGCTGCCCAT
TCGTTAATTGGATTGGACTATTTGCATCGTATTTGTGGTGTAATACACACTGATATTAAA
CCAGAAAACATAGTTGTTTCAAGCTCTTCTATTCCCATGGTTGATTTTAGAGTTATCAAC
ACTGAGGAAAAGTACGATGCTGATTCCTCAAACATCAAAGAATCAGACGACCACCACATT
CGGGATGGTTCTAATTCTGATAATAATATTAAAGACGTTACTACAGCAACTGAAATAACA
AATTCAACAACCACTGACTCTATTCATAATAGTAATATCAATGACTCCAGCAATGGAAAT
TCAACCCAGTATCATGGTTTAAACGCTAAAGAAAGAAGAAGGTTAAAGAGGAAAAATCAG
AGAAAAAACAAGCAAAAGTTAGGTTCTCAGGCTACCGAGTCAAGTAACGCTGAGGAAGAC
GTTATTGAATGTATACCAGGTGAGGCCAAAAATACTGAAGCTTCTGAAAAAGGAAAAAGA
TTATCTACTCCTCCATTTTTAAAGCTTCATCTGAAACCAATGCCTTCAGACCCAACTCAT
TCAAGTTATTATCAAATCAATTTTAATTCAAAGAAGTTGGAAAAAAGCTCTACTGAGCAG
AGCAAATTGGAGAATTTTAGCTTCAATTCTAACATGAATAACTTAAACCAGTTTCCTTTA
ATAAAACCACCTTATCATCATCACTTGTATGAAGTTTACCATCCTCAGCAGTATATTGCA
AATGACGAGCAAAGATATACTCATTTACTTCCTCTCACTCAATGGAACAAAAGTTATGTT
GGAGGTTCAGAACTATTATCACAAGAAAATCATAAGTTTTCAAAGAACTCGAAGGCAGAT
TCGAAAGCTCTTCGTTTTGAGGTTAATGAGGAAAAAATTATGGAGGTTTCCAATGTTATT
AAACAAATTAGTGTAAATCCTAACACGTTTTCTAGAAATGAAGCAGAGTACTTCATTGTA
GATCTTGGGAATGCTTGCTGGATGAATAAGCATTTTAGTCAAGATATTCAGACCAGGCAA
TATAGGAGTCCTGAAGTTATTGTAGGCGCGGGTTACGATTGGTCTGCTGACATTTGGAGC
TTAGGATGCACAATCTTTGAACTTCTAACCGGTGATCTTCTCTTCACTCCCAAGGCAACA
GAAGATTTTAGTGGTGACGATGATCATCTTGCTCAGATGATTGAGCTTCTTGGAGAATTT
CCTAAGTCTTTGATCAAATCAGGAAAACATTCAAAAAGATTTTTTAATAAACATAATAAA
TTACACAAAATCTCCAAGCTTCAATATTGGGATCTGAAGTCAGTTCTTATTCATAAATAC
TGTATCAACAAATTTGAGGCCCACAATTTTTCCTTATTCTTATATTCATTTTTGGCCTTA
GACCCCAGGATGAGGCCTGGAGCTCAAACTTTGCTTGATCATCCATGGCTACGTATCAGA
GGAGTCAGTTCGGATTATTTGGAAAATATGTTAGCTCGTCTTGAGAGACCTCTCACATTA
GTAGATGAAGAGAATATCTCCAGAGATCTACAATCTCTTACGATTTCTGAGAATGATCAA
GGAGATTACAAAAATTTATCCATTAAACAAGAATTATCCGAATGGTTTAATCAATTTAAG
AAGACCCTTGATTCATAA
1. Choose an organism
2. Perform BlastN and tBlastx
3. Compare results
Primer design
Primer 3 web site and link to instructions
a. Get the gene sequence - link
b. Obtain the entire gene sequence with introns.
c. Paste the sequence in the Primer 3 web site
d. Add parameters (as discussed)
Links
- FASTQ formats
Understanding your database
Go to your database of choice
PlasmoDB
SchistoDB
CryptoDB
... EuPathDB
- Go to the data summary link and find which data types are available
- Visit the EuPathDB data type link
- Do you understand all the data types?
Finding a gene using text search
Note: For this exercise use PlasmoDB
a. Find all possible kinases in Plasmodium. (hint: use the keyword “kinase” in the “Text” box).
- Choose a species
- How many genes did you get?
- How many of those are in P. falciparum? How did you find this out?
- What happens if you search using the word “kinases”? How many results did you return?
b. How can you increase the number of possible kinases in your results? (hint: the search you did in ʻaʼ will miss things like “6-phosphofructokinase” or
“kinases” so you need to use a wild card in your search – try “kinase*”, “*kinase” and “*kinase*”
- Did you get more results?
- Which one of the above wild card combinations gave you the largest number of kinases?
c. Find only the kinases that specifically have the word “kinase” in the gene product name. (hint: to do this you need to go to the text search page – there are many ways to get there, how did you get there?)
- How many kinases have the word kinase in their product names? (hint: did you remember to use the wild card?)
Gene page
a. Find the bifunctional dihydrofolate reductase-thymidylate synthase (DHFRTS) gene (or, if you prefer, the apical membrane antigen 1 gene; AMA1) in
P. falciparum.
- How did you navigate to this gene? What other ways could you get there? (hint: what about using the gene ID? PFD0830w)
- How many exons in this gene? How many nucleotides of coding sequence?
b. What genes are located upstream & downstream of DHFR-TS (AMA1) in P. falciparum?
- Is synteny (chromosome organization) in this region maintained in other species?
c. How many Single Nucleotide Polymorphisms (SNPs) can you identify within the P. falciparum DHFR-TS (or AMA1) gene?
- How many of these SNPs are in coding sequence? How many impact the predicted protein sequence? How many alleles are there for each SNPs? What is the maximum number of SNPs per strain?
- How do these results compare with SNP distribution in other genes?
d. Find the MSP1 gene of Plasmodium falciparum
- Go to the Genome Browser
- Is the MSP1 gene expressed?
- What kinds of data in PlasmoDB provide evidence for expression? At what life cycle stage is MSP1 most abundant?
- How do the different life cycle microarray expression profiles compare to each other? Are the results similar? What about RNAsequence
data, does it agree with microarray data?
- How abundant is the MSP1 protein? How confident are you of this analysis?
- Which genes are located upstream or downstream?
- Are there SNPs present in the gene sequence? How many alleles?

