Homework assignment 4
Part 1) Examining the effect of gap penalties on sequence alignment.
Please choose two proteins of interest to you which are only 30 to 50% identical and which have gaps in their alignment. The easiest way to find two such proteins is to use a protein of interest to you and perform a BLAST search of UniProt/SwissProt (either on UNIPROT or NCBI BLAST) and then choose the second protein with a score between 30% and 50% of the maximal score. Please check that the alignment with the query contains some gaps.
Align your two proteins using the Smith-Waterman algorithm (use either EBI's EMBOSS Water(man) or Expasy's SIM alignment tool). Perform the alignments with gap penalties of 5, 10, 20, 50 and 100. Keep the ratio of the gap penalty to the gap extension penalty a constant.
Examine the alignments and describe the effect of raising the gap penalty on the number and arrangement of the gaps. Mention which alignments have the highest bit score or quality and if the gaps or mismatches disrupt any known functional or structural motifs (Prosite, MyHits, InterPro or Conserved Domains). Which alignments give the most biological results? Show the alignments to support your conclusion.
Part 2) Comparing Smith-Waterman, UNGAPPED and GAPPED BLAST searches.
Find a protein family containing only 50-100 members by examining different motifs in the PROSITE database. If you can't find one quickly, you may use a member of the family of pyruvate dehydrogenase E1 component with enzyme commission number EC 18.104.22.168 for your gold standard (Uniprot advanced search (ec:"Pyruvate dehydrogenase (acetyl-transferring) (22.214.171.124) [126.96.36.199]" AND name:alpha) AND reviewed:yes). Also be sure to REMOVE any sequences described as "fragments" from your gold standard [(ec:"Pyruvate dehydrogenase (acetyl-transferring) (188.8.131.52) [184.108.40.206]" AND name:alpha) AND reviewed:yes NOT fragment]. You may use either subunit alpha (46 examples in UniProt/SwissProt [(ec:"Pyruvate dehydrogenase (acetyl-transferring) (220.127.116.11) [18.104.22.168]" AND name:alpha) AND reviewed:yes NOT fragment] ) or the beta subunit (50 examples). You may choose any member of the family to be your query. Do NOT use one of the bacterial pyruvate dehydrogenases sequences that have only a single subunit 800 amino acids long. You should print out and keep your "gold standard " family list from UniProt.
Choose one sequence of the family as a query and use the EBI BLAST page and SSEARCH page to perform UnGAPPED BLAST (set the parameter "gap align" to "false" in part EBI BLAST), SW BLAST (make sure the "gap align" parameter is set to "yes") and the standard Smith-Waterman algorithm searches of the UniProt/SwissProt database. Make sure that in each case you collect at least 100 sequences in your result set (or twice the expected number of family members). Also be sure to turn query filtering OFF (the default is ON).
Now compare the three searches using the Receiver-Operating Characteristic (ROC) curves. For each search, draw a line across each output list after every 10 results and count the total number of true and false positive results above each line (cumulative). Remember that the gold standard determines whether a sequence is a true or false positive. Continue until you have collected at least 50 false positive sequences (ROC-50 curve). Finally, plot the number of true positives versus the number of false positives on a two dimensional graph for each search. it can be helpful to plot all three curves on the same graph for your interpretation.
Do the three searches have identical shaped curves?
Is one curve higher than the other? If so, which search is the best for your protein family?
Please send the actual ROC curves (either a graphic file or an EXCEL spreadsheet with the graphs in place) to support your conclusions.
Send the results to firstname.lastname@example.org.