Protein Modeling

Protein Modeling was introduced as a trial in 2009 and as a National Event in 2010/2011.

Description
Students will explore protein structure and function by building 3D models of selected proteins based on computer models- one prior to competition, and one on site at the competition- and taking a test on both general protein-related concepts and information specific to the protein chosen for that year. For 2011, students will model proteins involved in reprogramming adult cells to become stem cells.

Pre-Build Model
For the pre-build model, a section of a certain protein will be specified from the Protein Data Bank. Students will be provided beforehand with a Mini-Toober of the correct length, at a scale of 2 cm per amino acid residue, and red and blue plastic end-caps representing the carboxy and amino termini of the protein chain respectively. The model must be correctly folded, and the end-caps placed at the correct ends of the protein chain, but the model must also include "creative additions" that showcase the function of the selected protein. Students must decide for themselves which sidechains and/or ligands are important to the protein's function and should be displayed. These additions must be explained on a two-sided, 3x5 index card, along with a paragraph explanation of how they are relevant to the protein's function, and submitted with the model, typically at impound.

On-Site Model
For the on-site model, a file will be specified from the Protein Data Bank, but the specific section of this protein that students are to build will not be stated until the competition. At the competition, students will be given a Mini-Toober of the correct length, at the same scale of 2 cm per amino acid residue, foam representations of important sidechains or associated ligands, and red and blue plastic end-caps representing the carboxy and amino termini of the protein chain respectively. The model must be correctly folded, the end-caps placed at the correct ends of the protein chain, and the given sidechains or ligands attached to the Mini-Toober at the correct locations and with the correct orientations.

Written Test
The written test consists of multiple choice and short answer questions about both general protein topics (e.g., "What is the full name of the N-terminus of a protein chain?"), information specific to the selected protein (e.g., for the 2009-10 hemagglutinin, "Why are pigs important to the spread of influenza viruses?"), and information extremely specific to the PDB file (e.g., "What is the resolution in Angstroms of the PDB file 1HTM.pdb?").

These attributes are consistent across all levels of competition, because the Center for Biomolecular Modeling at MSOE provides all tests for the Protein Modeling event. However, the questions (at least ostensibly) get more challenging at each successive level of competition.

Scoring
40% of the score is determined by the pre-build model, 30% by the on-site model, and 30% by the written test.

Pre-build Model
The pre-build model is scored based on:
 * Accuracy and correct placement of secondary structures (alpha helices and beta sheets)
 * Accuracy of tertiary structure (3D arrangement of secondary structures- e.g., two strands of a beta sheet being correctly placed next to each other, and a nearby helix being perpendicular to both)
 * Relevance and correct placement of sidechains or ligands you've chosen to display (and clarity/accuracy of the explanation on your card), as well as correct placement of end-caps
 * Relevance and creativity of "creative additions" (i.e., labeling the "fusion peptide" on hemagglutinin would get you points, but displaying every sidechain wouldn't, since that doesn't show any understanding of the protein's function)

For an example of how specific it has to be (and of what constitutes good "creative additions"), see the [[Media:2010_SONT_Prebuild.pdf|2010 National Pre-build Rubric]].

On-Site Model
The on-site model is scored based on:
 * Accuracy and correct placement of secondary structures (alpha helices and beta sheets)
 * Accuracy of tertiary structure (3D arrangement of secondary structures- e.g., two strands of a beta sheet being correctly placed next to each other, and a nearby helix being perpendicular to both)
 * Correct placement and orientation of given sidechains and ligands, as well as correct placement of end-caps

For an example of what exactly they look for in the on-site model, see the [[Media:2010_SONT_Onsite.pdf|2010 National On-site Rubric]].

Written Test
There are relatively few questions on the test- for example, the 2010 National test consisted of 10 multiple choice and 5 short answer questions. Each multiple choice question was worth one point, and the short answer questions were worth 20 points total (each question was assigned its own point value, which was printed on the test). A similar number of questions and mode of scoring can be expected on future tests. Some of the short answer questions will be selected as tiebreaker questions, and will be designated as such. These still count normally toward your regular score on the test portion, but will be weighted more in the event of a tie.

Materials
This year, five two-sided 8.5x11 sheets of reference materials from any source, typed, handwritten, or graphics, are allowed, along with a scientific calculator and writing utensils (a Sharpie or other permanent marker is recommended for marking the Mini-Toober for the on-site model, as well as normal pens/pencils for taking the test).

Content
While this information may appear on the written test, it's also important to know for the construction of your models- particularly the pre-build model, on which you have to decide what's vital to show or not.

Protein Structure
"Structure equals function" is the basic tenet of Protein Modeling: i.e., it's important to know what a protein's structure is like because its function is determined by its structure.

There are four different types of protein structure: primary, secondary, tertiary, and quaternary.

Primary Structure
Primary structure is the sequence of amino acid residues in a protein chain (they're called residues, by the way, because they're not individual amino acids anymore, having lost a hydrogen off their amino groups and a hydroxide ion off their carboxylic acid groups in the process of bonding through dehydration synthesis; TL;DR the ends of the amino acids are missing because they're connected, so we call them "residues" instead of "amino acids"). There are 20 main varieties of amino acid, which differ only in their sidechain (sometimes called an "R group").

Different residue sidechains have different properties; for example, the red sidechains in the diagram are negatively charged, and the blue ones are positively charged. These properties determine how the protein folds (i.e., the secondary and tertiary structure), because certain types of residues attract, repel or bond to other types of residues. Also, the types of residues present can determine how the protein interacts with other molecules such as DNA- for example, serine can form hydrogen bonds, and therefore is often found at binding sites in a protein.

Charged sidechains repel like charges and attract opposite charges. Hydrophilic, or polar, sidechains usually end up on the outside of a folded structure, because most proteins fold in a watery environment and the polar sidechains interact well with water, which is also polar. For the same reason, hydrophobic, or non-polar, sidechains usually end up on the inside of the structure, because they do not interact well with water. Cysteine, which is shown in green in the diagram, forms very strong covalent disulfide bonds with other cysteines.

Each residue in a chain is given a number, starting at the amino terminus (that is, the end that has an amino group still present) with 1 and going up to the carboxy terminus (the end that has a carboxyl group still present).

Secondary Structure
Secondary structure is the first level of folding in a protein. Patterns called "motifs", such as alpha helices and beta sheets (by far the two most common), are caused by hydrogen bonding between the backbone carbons (the central carbons of amino acids, also known as alpha carbons) of the residues. Alpha helices are slightly more common in proteins overall than beta sheets. These helices are tightly coiled single strands, kept in place by hydrogen bonds between nearby residues. They can be anywhere from only a few residues in length to over 100 Angstroms in some proteins. They tend to be the base of protein "stalks" (such as that of 2009-10's influenza hemagglutinin).

Beta sheets, on the other hand, are made up of many beta strands- kinked sequences of residues separated by loops. These strands line up parallel to each other- actually, antiparallel, which means that adjacent strands point in opposite directions (direction matters, remember, because of the numbering of residues from the amino terminus to the carboxy terminus)- with multiple hydrogen bonds between adjacent strands. They are very strong as protective or support layers (such as the "beta-barrel" exterior of GFP).

Tertiary Structure
Tertiary structure is the position in three dimensions of the secondary structures (motifs). It is determined by the secondary structures present, as well as the properties of the sidechains. Hydrophilic sidechains such as glutamine will move to the "outside" when the protein is folded in a watery environment, while hydrophobic sidechains such as tryptophan will cluster "inside" the protein, protected by other sections of the protein, to prevent their exposure to water. Oppositely charged sidechains come together, forming salt bridges (ionic bonds), while sidechains with the same charge repel each other. Cysteine, which contains sulfur, bonds covalently with other cysteines to form strong disulfide bonds. The interaction of all these attractions and repulsions cause the protein to develop a unique shape in 3D, called a "conformation".

The protein's tertiary structure also depends on the environment in which it is folded: in the human body, which is a watery environment, the hydrophobic (nonpolar) sidechains end up on the inside, as stated above. However, if a protein is folded in a nonpolar solvent (such as, say, vegetable oil), the hydrophilic (polar) sidechains end up on the inside.

Quaternary Structure
Quaternary structure is the arrangement of each of the individual pieces (monomers) of a multi-unit (multimeric) protein. These subunits, or "chains" as they are often called, each have their own amino and carboxy terminus, and are not physically attached to each other. However, they are held together by bonds- which can be disulfide or ionic, although more commonly the latter- and arranged together in a specific conformation. Multimers are quite common, and may contain several distinct chains or simply several copies of the same one (or few).

2011-Specific Info
The 2011 pre-build protein is Klf4, a protein used to turn adult somatic cells (i.e., normal body cells) into induced pluripotent stem cells (iPSCs), and found under the Protein Data Bank ID 2WBU.

At Regional competitions, the on-site model will be a selected region of Oct4 (PDB ID 1GTO). At States, it will be a section of Nanog (PDB ID 2KT0). At Nationals, it will be a part of c-Myc (PDB ID 1NKP). All four of these have been used in the process of creating iPSCs. If you go to an Invitational tournament, according to the CBM site, the on-site model will be Sox2 (possibly under PDB ID 1O4X, although neither the rules nor the CBM site specify a file).

The written test will, as usual, be about the relationship between protein structure and function, but will have an emphasis on iPSCs.

Klf4
Klf4 is a DNA transcriptional factor characterized by its three Cys2 His2 zinc fingers. It can promote or repress the expression of other protein-coding genes, notably that of Nanog, another factor involved in pluripotency. It can also up-regulate its own expression, forming a positive feedback loop. It binds to the gene sequence CACCC.

Oct4
IN PROGRESS

Nanog
Nanog is a transcriptional factor essential to maintaining pluripotency in stem cells. Its expression is promoted by Klf4 and Oct4.

IN PROGRESS

c-Myc
Because c-Myc is oncogenic (promotes the development of cancer), researchers have been working to create iPSCs without the use of c-Myc. It is possible to create iPSCs without c-Myc, although the process is much less efficient (in human fibroblasts– one type of connective tissue cell- removing c-Myc decreased iPSC production by 90%; in epithelial– or membrane– cells, removing c-Myc decreased iPSC production by 40%). However, the risk of tumor formation outweighs this inefficiency when creating iPSCs for therapeutic use (which is not yet available, except experimentally).

IN PROGRESS

iPSCs
Induced Pluripotent Stem Cells are ordinary body cells that have been turned back into undifferentiated stem cells with the ability to become any kind of body tissue. Pluripotent stem cells, including iPSCs, are characterized by their ability to become any type of cell from any of the three germ layers (although they can't become placental/extraembryonic tissue; only totipotent stem cells can do that), and to reproduce more or less indefinitely while maintaining their undifferentiated state. Scientists such as Shinya Yamanaka have done this by introducing the genes for different transcriptional factors involved in maintaining pluripotency in embryonic stem cells into somatic cells (originally mouse cells, but it has now been done with human cells as well) through retroviral transfection. Yamanaka used the four factors involved in the event this year. Other factors that have been used include Lin-28 and Esrrb.

Retroviral transfection (or transduction) is a method of introducing genes into cells by taking advantage of viruses' ability to insert genes into a host cell's chromosomes. Viruses do this in order to make the host cell produce the proteins they need in order to reproduce, but if they are instead loaded with the RNA that codes for a desired factor, the host cell will begin to produce that instead. The problem with this method is that it is nearly impossible to control where the virus will insert the gene it carries, so it can disturb necessary existing genes in the cell by inserting in the middle of them. Also, the virus often inserts multiple copies of the same gene, which– particularly in the case of transcription factors that promote cell reproduction and inhibit differentiation, like those involved in the creation of iPSCs– can carry a higher risk of tumor formation. Scientists are currently working on other methods of inserting the genes to lower these risks.

The human cells initially used by Yamanaka and others were fibroblasts, a type of connective tissue cell that makes collagen. These don't divide a lot, so it's difficult to make them into iPSCs; more recently, epithelial cells have been used. Epithelia cells are found covering various tissue surfaces, and divide much more frequently than fibroblasts. They have therefore been more efficient at turning into iPSCs. Particularly, c-Myc (which is oncogenic) is less important in epithelial cell iPSCs, which decreases the likelihood of tumor formation.

IN PROGRESS

Using Jmol
Jmol is the software used to create computer models of the proteins for the Protein Modeling event.

Downloading Jmol
Jmol is open-source and can be downloaded here or used online. This is the 2011 pre-build online Jmol environment; it offers the same features as the online Jmol environment you will be using in competition to model the on-site protein, so it's good to use it at least once before competition to become familiar with the interface.

However, it's definitely still worthwhile to download Jmol onto your computer. One of the main reasons for this is that the pre-build online environment only lets you work with that one file, so if you want to familiarize yourself with the on-site protein files before competition, you'll need the Jmol application on your computer.

Downloading Jmol can be a little tricky, though. The download page linked to above gives specific instructions, but to summarize:

Click the link at the top of the page, which will take you to a list of files to download. The file you want is the one called "Jmol-12.0.14-binary.zip", which should have a bunch of different icons next to it (Apple, Windows, Linux, etc).

Your computer will then unarchive the file (it should do this automatically; if it doesn't, you're missing the unarchiving software and need to ask for help from someone who knows more about computers), giving you a folder full of a ton of different files, pretty much all with "Jmol" in the name. The only one you need is Jmol.jar.

Put this one in its own new folder, as close to your home directory as possible, and name the folder something that doesn't have any spaces in it (Jmol can be stupid about saving and retrieving files; spaces in the file path sometimes confuse it, so not only should you not put spaces in the folder name, you should put that folder in your home directory or in a folder in your home directory that has no spaces in the name, to make sure there are no spaces higher in the file path). You can delete everything else from the .zip file.

Save every PDB file you download and every script you create to this folder. Again, Jmol can be stupid; it sometimes has trouble opening files that aren't saved to the same folder that it's in.

Downloading PDB Files
If you are using Jmol on your own computer, rather than using the online pre-build environment, you will need the PDB file(s) on your computer as well (note: technically you only need the file for the pre-build; however, to do as well as possible in competition, it helps to be familiar with the on-site beforehand as well. This is one of the main reasons it's worthwhile to download Jmol, so you can look at any protein file you want).

To download a PDB file, go to its structure summary page (linked to in the 2011-Specific Info section) and click the "Download Files" option in the upper-right corner. A drop-down menu will appear, as shown below. You want the PDB File (Text).



Remember to save this file in the same folder as Jmol.jar!

Entering Commands
There are two ways to enter commands in Jmol- through the command prompt or by selecting options from the drop-down menus. Although people who have never done any programming or used a command line setup before might be more comfortable with the menus, it is vastly easier to use the command prompt, and not that difficult once you get used to it. If you're using the online Jmol environment instead of downloading it yourself, you must use the command prompt (most of the drop-down menus aren't available), and since everyone has to use the online environment in competition, it is necessary to know how to use the command line. If you've downloaded Jmol onto your computer, to open the command prompt, go to the File menu and click "Output Console". Some useful commands are listed in the section below.

Quick Guide to Jmol Commands
These are the commands you're most likely to use frequently while modeling your protein in Jmol:


 * select (object):Selects the specified objects, making them the subject of future commands (i.e., "select *a" would select chain A of your protein, since * is the symbol for chain).


 * restrict (object):Selects the specified objects and makes everything else disappear (i.e., "restrict *a" would leave you with a screen showing only chain A, which would also be selected).


 * center (object):Centers the field of view on the specified objects. This is an incredibly useful command after a "restrict", because otherwise, every time you try to look at the other side of your protein fragment, it will swing out of your field of view.


 * undo: undoes the last command


 * redo: redoes the last undone command

Objects you can select include:


 * chains:As mentioned above, the symbol for a chain is *. It should work as *[letter] or * [letter](i.e., with or without the space).


 * specific residues:You can specify residues using their residue numbers- individual or a range- or three-letter codes. For example, "select 41" will select residue 41, "select 41-60" will select all residues from 41 to 60 inclusive, and "select his" will select ALL histidines. If you wanted to select all histidines between residues 41 and 60 only, you can string together ranges and residue types using Boolean operators (see below).


 * DNA:DNA works the same way as proteins do in Jmol- in that you can select specific parts, like the backbone or nucleotides- but can also be selected as a whole with "select dna".


 * non-protein stuff:Everything that isn't protein (or DNA) is "hetero", as in "select hetero". This includes any ligands (such as ions or small molecules) associated with the protein, as well as water (which has its own designation as well- "hoh"). Because of the method by which protein structure data is obtained, all the surrounding water molecules are also captured. They can be quite distracting, so you may want to remove the water or just not show it at all ("select hetero and not hoh"- see below for more information about Boolean operators).


 * sidechains:When you view a given residue, you probably don't want to see the entire residue, including the amino and carboxy groups (or what's left of them); in most cases, you only want to see the sidechain sticking out from the alpha-carbon backbone. You can "select sidechain" (which will select all sidechains of all residues) or "select [residue] and sidechain" (see below for more information about Boolean operators) to select just the sidechain of a particular residue.


 * alpha helices or beta sheets:When folding your Mini-Toober model, it can be helpful to see where the helices and sheets are. You can select the alpha helices with "select helix" and the beta sheets with "select sheets".

Once you have something selected, you can modify the way it is displayed in various ways.


 * backbone [number]:Displays the selection in backbone mode- where only the alpha carbons of the amino acid residues are shown- with the number being the radius. This is the display mode you'll start with- the Mini-Toober represents this part of the protein. A radius of about 300 works well.


 * wireframe [number]:Displays the selection in wireframe mode- where the connections between atoms (not just alpha carbons) are shown as lines/cylinders- with the number being the radius of the cylinders. This display mode is the most common way of viewing sidechains; a radius of about 200 works well. To remove wireframe from the selection, use "wireframe off".


 * spacefill [number]:Displays all atoms in the selection as spheres, the number being the radius of the spheres. If you don't put any number, Jmol will show the full Van der Waals radius of the atoms. This is another way of viewing sidechains or, more commonly, ligands such as ions. For ions, typically no number is needed (the Van der Waals radius works just fine), but if you're using it for your sidechains, you most likely want something slightly larger than your wireframe (225-250). To remove spacefill from the selection, use "spacefill off".


 * color cpk:Colors the selection according to the "CPK" color system, which has a specific color for each element. Carbon is gray, oxygen is red, nitrogen is blue, and sulfur is yellow (these are the most common ones you'll see). Note: if you have the entire protein in "backbone" mode, the entire protein will appear gray, because only the alpha-carbon backbone is being displayed.


 * color [R,G,B]:Colors the currently selected objects the RGB values specified (you put numbers in place of R, G, and B to specify what color you want). There's also the easy way: instead of putting in [R,G,B], just put in a color name. Jmol only knows some of them, though (for instance, it knows "yellow", but probably doesn't know "tangerine". It might, though, since it has a pretty extensive- and somewhat bizarre- color library).


 * color structure:Colors the alpha-helices and beta pleated sheets.


 * color group:Colors the protein to display the amino and carboxy termini.

Boolean Operators
Boolean operators are connecting words like "and" and "or", which determine how Jmol combines the different groups you're telling it to act on.


 * and:If you use "and" to connect two things, Jmol will act on only things that fit both criteria. For example, "select his and sidechain" selects only the sidechains of all the histidine residues. You can also string together "and"s, and Jmol will act only on things that fit all the criteria: "restrict *a and 1-40 and cys" selects only cysteine residues in the interval from 1-40 of chain A.


 * or:If you use "or" to connect two things, Jmol will act on anything that fits either of your criteria. For example, "select cys or his" will select all cysteine and all histidine residues. Note: "or" is often what you want when intuitively you'd think "and": to select cysteines AND histidines, you type "select cys or his". "select cys and his" does nothing, because there are no residues that are both cysteine and histidine.


 * not:This tells Jmol to act on everything that doesn't fit some given criteria. For example, "select not dna" selects everything in the file that isn't DNA. It can be combined with other operators, like "select hetero and not hoh", which selects all non-protein, non-DNA matter that also isn't water.


 * xor:If you use "xor" to connect two things, Jmol will act on anything that fits only one of your criteria.

Combining Operators
Parentheses can be used to combine different Boolean operators in different ways. If all your operators are the same (e.g., "*a and 1-40 and his" or "his or cys or gly"), you don't need any parentheses; however, if you have a combination of "and"s and "or"s or want a "not" to apply to some combination of criteria, you need parentheses to tell Jmol which operators should be grouped together.

For example, "select (*a or *b) and his" will select all histidines in chain A and all histidines in chain B, whereas "select *a or (*b and his)* will select ALL of chain A and all histidines in chain B. If you leave out all parentheses, "and" takes precedence over "or", so Jmol will assume the latter case.

It can also be confusing figuring out what your "not" applies to. If there are no parentheses present, it modifies only the object immediately following it (i.e., "select not *a or his" selects everything that isn't part of chain A, as well as all histidines, including those in chain A- one of the criteria is "not *a" and one is "his", and everything that fits at least one or the other gets selected). To have your "not" apply to more than one criteria, use parentheses after the not; for example "select not (*a or his)" will select everything except chain A that also isn't a histidine. It has the same effect as "select not *a and not his".

Helpful Tips
IN PROGRESS

Folding the Mini-Toober Model
Your Mini-Toober model will be based on your Jmol model, but unfortunately, the only way to convert the latter into the former is to use the Jmol image as a sort of "map" showing how to fold the toober.

Fortunately, CBM has an excellent step-by-step video tutorial of how to fold the Mini-Toober based on what you have in Jmol. This should answer any questions you have about creating your Mini-Toober model.

Useful Links

 * Mini-Toober folding tutorial (mentioned above)
 * Protein Data Bank
 * Zinc fingers Molecule of the Month article
 * General protein info (sadly, this link appears to be broken. I can't find the original site anymore, although there are links to it all over different protein sites. None of them have the same information, though)