Project


Besides scribing, the project is the one assignment for the course. You may work with one or two partners if you wish (see below for additional requirements). Choose from the topics listed or discuss with us a modification or original idea of your own. Take a look at the links to papers and sites in the References page as there are several there relevant to the given topics.

Projects are to be presented in the final class of the quarter. A 3-5 page report is due in Finals week.


Schedule:

Feel free to discuss your progress and any questions at any time. Formally, there are two checkpoints, with exact dates to be determined based on the size of the class.
Week 3 (April 17-19): Checkpoint 1. Discuss with T.A. your project choice and submit proposed milestones with dates.
Week 7 (May 15): Checkpoint 2. Submit 1/2 page report on milestones achieved. Discuss remaining work.
Week 11 (June 12): Written paper due, present in class (time limit to be decided based on number of projects).


Requirements:

Generally, projects should include all of the below. The relative weightings between various aspects may of course vary with topic.


Topics:

Structure prediction
Clustering and distance metrics
Protein design
Something else

  1. Structure prediction. The prediction of the tertiary structure of a protein from its sequence is an active, and crowded, area of research. As discussed in the reading, there are various algorithms in existence, none of which have solved the problem. The best performing approaches today are those that combine several programs together.

    It is easy to get a prediction right if one knows the answer, and it is hard to compare the quality of prediction algorithms when used on different targets. In response to these concerns, the prediction groups have embraced public prediction servers that implement their algorithms, and also embraced periodic tests of algorithms with unpublished protein structures. CASP and CAFASP are the well known competitions. The Live Bench assessment is a continuous test of servers.

    In this project, examine the top scoring non-consensus algorithms tested in Live Bench and/or CASP (choose at least 3). Because they are more developed, we suggest you focus on homology modeling and/or secondary structure algorithms. Also, because of its popularity and success in CASP, we suggest the Rosetta algorithm as one to consider. Where does each algorithm have its strengths and weaknesses? What does each require?

    You'll notice on Live Bench that the so called "consensus servers," that combine results of other servers, do best. Looking at these results together with your previous analysis, propose your own consensus scheme and justify it. How does it perform? A simple implementation is to write a script that automates uploading and downloading from the servers you wish to combine, and you can download CASP or previous LiveBench targets as test sequences. Beware of broad large generalizations from the relatively small test sets you'll be using.

    If there are two of you, consider more formal methods of determining when to use a given algorithm or for combining algorithms. For either of these, a clustering or learning algorithm may be applied, for example, with the feature space up to you.

    If there are three of you, then extend the above. One way is by a quantification of confidence in a given server's prediction for a given target (based on its performance for other, similar targets). Another option is to modify the package Modeller or another code with your group's ideas for combining/improving algorithms.


  2. Clustering and distance metrics. The clustering of conformations (of a single protein) is vital to the study of protein dynamics. As seen, its use in roadmaps and Markov models has several applications. The clustering of different proteins is also important , for example to the study of evolution and in the similarity searches are used for a variety of applications.

    In this project, choose one of the two scenarios (different conformations of the same protein or different proteins) and examine best practices for clustering. Then, focus on choosing a distance metric that you believe interesting and 1-2 clustering algorithms that are computationally efficient for that. Standard clustering algorithms to remember are K-means, hierarchical clustering, and simple grid. Examples of distance metrics are dRMSD and secondary structure fractional commonality (we can point you to software that computes these if you don't want to write them yourself). How does changing the metric change your output? How does changing parameters to the clustering algorithm change output? If you are clustering conformations, ask the TA for a data set of molecular dynamics trajectories you may use. If you are classifying proteins, use this data set. For the latter, consider what CATH and SCOP do and see how existing online classifiers do. You can incorporate their results in the clustering if you wish.

    If there are two of you, and if you are clustering conformations, then consider how you can use your clustering to present the ensemble of trajectories. In other words, use the clustering to assess commonalities and differences between different trajectories. If you are classifying proteins, then consider how you can combine various classifiers to come up with a more interesting or accurate (for a given classification type) classification.

    If there are three of you, then do the clustering of both conformation and proteins discussed for a single person project (the code will be mostly common) and then choose one of two tasks outlined for a two person group.


  3. Protein design. This is a topic we did not discuss in class so it is here as an option to pursue if you wish. Protein design can be considered the opposite of structure prediction -- we wish to determine what low energy sequence will yield a given structure (actually, we may also want to design for given functions, but leave that aside for now). One would think that if we understand protein structure, then we should be able to design them. Unfortunately, for one thing our understanding is certainly not complete, and for another, even if it was, the sequence space is huge (20^n where n is the sequence length). The latter point makes protein design an interesting algorithmic area. We have put some basic references here.

    First, familiarize yourself with the major algorithms in use: dead-end elimination, genetic algorithms, and Monte Carlo. What are the differences between them and what assumptions do they make? (In particular, pay attention to what type of stipulations on the form of the energy function each has!) What approximations do they make? (Pay attention to flexibility!)

    Due to the computational burdens, there are not many protein design servers like there are prediction servers, but experiment with Rosetta Design. How does it do on a variety of structures? Implement the mentioned design algorithms plus one or two variations (published or original) yourself for a toy problem. In particular, you need not write code to read in protein structures or have proper parameterization of atoms. Indeed, you can think of trying to design a chain of jelly beans instead of amino acids -- the point is to compare the algorithms not actually design a true protein.

    For two or three people, consider the same questions as above but actually have your code take PDB structure inputs and make an effort to have plausible parameters. This may be simplest to do by building off an existing simulation package that already has those aspects rather than from scratch (we can provide more specific tips on this), but do not attempt that unless you feel comfortable with software engineering on an existing code base. You need implement only one algorithm since energy models may constrain choice. We don't require you to successfully design a protein, but do assess the program's performance.


  4. Something else. You may have ideas of your own. If so, discuss with us. One area not covered in depth above, for example, is dynamics, because it is difficult to come up with a project related to that which is of only 1 quarter in scope. If you have an idea on this or other area, though, feel free to propose it.