Difference between revisions of "Protein Modeling"
(→Secondary Structure: Add antiparallel/parallel distinction and some other info about helices and b-sheets,)
|(One intermediate revision by the same user not shown)|
Revision as of 05:31, 17 January 2021
|Question Marathon Threads|
|Division C Results|
Protein Modeling is a Division C event for the 2021 season. It was introduced as a trial event in 2009 and as an official national event in the 2010-2011 season, 2011-2012 season, 2014-2015 season, 2015-2016 season, 2018-2019 season, and the 2019-2020 season. For the event, students use computer visualization and online resources to guide the construction of physical models of proteins and to understand how protein structure determines function. In 2016, students modeled proteins involved with dopamine and serotonin. In 2019, students studied the CRISPR-Cas9 and one of its inhibitors, AcrIIA4. In 2020 and 2021, students are studying the APOBEC3A Cytidine Deaminase and how it is used in direct base editing.
- 1 Previous Topics
- 2 Event Overview
- 3 Protein Structure
- 4 Using Jmol
- 5 Folding the Mini-Toober Model
- 6 Useful Links
|2011||Pluripotent Stem Cells|
|2016||Dopamine and Serotonin|
|2019||CRISPR and Anti-CRISPR|
|2020||CRISPR and Base Editing|
In past years, competitors would build 2 models - one prebuild model and one onsite build. However, in 2020 this was changed and instead of an onsite model, students will explore any given protein using Jmol and answer questions regarding its structure. The prebuild still remains part of the competition, as well as a written test on a specific topic.
For the 2021 season scoring is based on the following components:
- Prebuild model: 40%
- Jmol Exploration: 30%
- Written Test: 30%
In the 2016-2017 season, five two-sided 8.5x11 sheets of reference materials from any source, typed, handwritten, or graphics, were allowed, along with a scientific calculator and writing utensils (a Sharpie or other permanent marker is recommended for marking the Mini-Toober for the on-site model, as well as normal pens/pencils for taking the test).
For the 2018-2019 season, only one two sided letter-size "cheatsheet" is allowed. Calculators are not allowed this year. In addition to markers, pens and pencils, teams are also encouraged to bring a metric ruler to aid in building the onsite model to the 2cm scale.
For 2020 and 2021, each participant may bring a sheet of paper - if there are three individuals competing in the event, each one may have a sheet of information. However, calculators are still not allowed.
For the pre-build model, a section of a certain protein will be specified from the Protein Data Bank. Students purchase beforehand a Mini-Toober of the correct length, at a scale of 2 cm per amino acid residue, and red and blue plastic end-caps representing the carboxyl and amino termini of the protein chain respectively. The model must be correctly folded, and the end-caps placed at the correct ends of the protein chain, but the model must also include "creative additions" that showcase the function of the selected protein. Students must decide for themselves which sidechains and/or ligands are important to the protein's function and should be displayed. These additions must be explained on a two-sided, 3x5 index card, along with a paragraph explanation of how they are relevant to the protein's function, and submitted with the model, typically at impound.
The pre-build model is scored based on:
- Accuracy and correct placement of secondary structures (alpha helices and beta sheets)
- Accuracy of tertiary structure (3D arrangement of secondary structures- e.g., two strands of a beta sheet being correctly placed next to each other, and a nearby helix being perpendicular to both)
- Relevance and correct placement of sidechains or ligands displayed on the model (and clarity/accuracy of the explanation on the card), as well as correct placement of end-caps
- Relevance and creativity of "creative additions" (i.e., labeling the "fusion peptide" on hemagglutinin would receive points, but displaying every sidechain wouldn't, since that doesn't show any understanding of the protein's function)
For an example of how specific the pre-build model has to be (and of what constitutes good "creative additions"), see the 2010 National Tournament Pre-build Rubric and the 2011 National Tournament Pre-build Rubric.
For the on-site model, a file would be specified from the Protein Data Bank, but the specific section of the protein that students would have to build would not be stated until the competition. At the competition, students would be given a Mini-Toober of the correct length, at the same scale of 2 cm per amino acid residue, foam representations of important sidechains or associated ligands, and red and blue plastic end-caps representing the carboxyl and amino termini of the protein chain respectively. The model had to be correctly folded, with the end-caps placed at the correct ends of the protein chain and the given sidechains or ligands attached to the Mini-Toober at the correct locations and with the correct orientations.
The on-site model was scored based on:
- Accuracy and correct placement of secondary structures (alpha helices and beta sheets)
- Accuracy of tertiary structure (3D arrangement of secondary structures- e.g., two strands of a beta sheet being correctly placed next to each other, and a nearby helix being perpendicular to both)
- Correct placement and orientation of given sidechains and ligands, as well as correct placement of end-caps
On-Site Jmol Exploration
For the on-site model, a file will be specified from the Protein Data Bank, and students will be expected to use the Jmol viewing environment to answer questions about its structure. The protein(s) selected may or may not be related to the current topic. The student may also be asked about the protein's function.
A student may be asked:
- How many sulfur atoms are in this protein?
- Where are the polar and nonpolar sidechains located in this protein?
- How many polypeptide chains does this protein contain?
- What is the function of this protein?
The written test consists of multiple choice and short answer questions about both general protein topics (e.g., "What is the full name of the N-terminus of a protein chain?"), information specific to the selected protein (e.g., for the 2009-10 hemagglutinin, "Why are pigs important to the spread of influenza viruses?"), and information extremely specific to the PDB file (e.g., "What is the resolution in Angstroms of the PDB file 1HTM.pdb?").
There are relatively few questions on the test- for example, the 2010 National test consisted of 10 multiple choice and 5 short answer questions. Each multiple choice question was worth one point, and the short answer questions were worth 20 points total (each question was assigned its own point value, which was printed on the test). A similar number of questions and mode of scoring can be expected on future tests. Some of the short answer questions will be selected as tiebreaker questions, and will be designated as such. These still count normally toward the regular score on the test portion, but will be weighted more in the event of a tie.
See the Test Exchange for exams and answer keys from the 2010 and 2011 National Tournaments.
These attributes are consistent across all levels of competition, because the Center for Biomolecular Modeling at MSOE provides all tests for the Protein Modeling event. However, the questions (at least ostensibly) get more challenging at each successive level of competition.
"Structure equals function" is the basic tenet of Protein Modeling: i.e., it's important to know what a protein's structure is like because its function is determined by its structure.
There are four different types of protein structure: primary, secondary, tertiary, and quaternary.
Primary structure is the sequence of amino acid residues in a protein chain. When the amino acids bind to become proteins, they lose a hydrogen from the amino group and a hydroxide ion from the carboxylic acid group through dehydration synthesis. The amino acids are called residues because when a peptide bond is formed by dehydration synthesis, a molecule of water leaves the newly formed peptide, leaving behind dehydrated versions, or residues, of the amino acids bonding. There are 20 main varieties of amino acid, which differ only in their sidechain (sometimes called an "R group"). Each amino acid is known by three names - the full name, usually derived from what the amino acid was first isolated from; the three-letter abbreviation; and a one-letter abbreviation. For example, tyrosine (from Greek tyros for cheese) is also known as Tyr and Y.
The central carbon to which the carboxylic acid, amine, and R-group is bound serves as a chiral center, meaning that amino acids can exist in two different forms, the L-enantiomer and the D-enantiomer. The L-enantiomer is by far the most prevalent form in nature. It is important to remember that since enantiomers are not superimposable, D-enantiomers cannot substitute L-enantiomers in function.
Amino acids are zwitterions, which are neutral species that can carry both a positive and a negative charge. In solution, the proton on the carboxylic acid end will generally dissociate and protonate the amine end, creating a negative charge on the newly formed carboxylate end and a positive charge on the amine end. This protonation and deprotonation is pH dependent, so the charges of amino acids can change over a pH range, one of the reasons why altering pH of a protein can affect the activity of the protein.
Different residue sidechains have different properties. One of the most important properties of an amino acid is its polarity, which is highly dependent on the sidechain. Polar functional groups (hydroxides, amines, sulfhydryls) as well as charged functional groups increase the polarity of an amino acid, while the lack of these functional groups decreases polarity. Polarity is a major factor in determining hydrophobicity. Water is highly polar and will interact via intermolecular attractions with other polar molecules, so polar amino acids are hydrophillic. In contrast, water does not interact much with nonpolar molecules, so nonpolar amino acids are generally hydrophobic.
The charge of a functional group also affects the properties of an amino acid. For example, the red sidechains in the diagram are negatively charged, and the blue ones are positively charged. Charges are determined by which of the three forms of the amino acid is most prevalent at physiological pH. Recall the definition of a Bronstead-Lowry acid/base pair - acidic species donate protons to water to form hydronium whereas basic species accept protons from water to form hydroxide. An amino acid that is negatively charged at neutral pH has been fully deprotonated, so they are commonly referred to as "acidic". Similarly, amino acids that are positively charged have generally been protonated and are known as "basic". These properties determine how the protein folds (i.e., the secondary and tertiary structure), because certain types of residues attract, repel or bond to other types of residues. Also, the types of residues present can determine how the protein interacts with other molecules such as DNA- for example, serine can form hydrogen bonds, and therefore is often found at binding sites in a protein.
Charged sidechains repel like charges and attract opposite charges. Hydrophilic sidechains usually end up on the outside of a folded structure, because most proteins fold in an aqueous environment and the polar sidechains interact well with water. For the same reason, hydrophobic sidechains usually end up on the inside of the structure, because they do not interact well with water. In addition, hydrophobic sidechains often interact with other hydrophobic sidechains in what is termed hydrophobic interaction.
Cysteine, which is shown in green in the diagram, is a special amino acid. When two cysteine residues are oxidized, a disulfide bridge is formed between their side chains, forming a molecule named cystine. Disulfide bridges help many proteins maintain their structure by covalent bonding.
Each residue in a chain is given a number, starting at the amino terminus (that is, the end that has an amino group still present) with the lowest number (which is not always 1, depending on the numbering conventions for the particular family of proteins) and going up to the carboxy terminus (the end that has a carboxyl group still present).
Secondary structure is the first level of folding in a protein. Patterns called "motifs", such as alpha helices and beta sheets (by far the two most common), are caused by hydrogen bonding between the backbone carbons (the central carbons of amino acids, also known as alpha carbons) of the residues.
Alpha helices are more common in proteins than beta sheets. These helices are tightly coiled single strands, kept in place by hydrogen bonds between nearby residues - every backbone N-H group hydrogen bonds to the C=O group of an amino acid 3-4 residues earlier in the strand. On average, there are 3.6 residues per turn. Alpha helices can be anywhere from only a few residues in length to over 100 Angstroms in some proteins. They tend to be the base of protein "stalks" (such as that of 2009-10's influenza hemagglutinin). The backbone hydrogen bonds of alpha helices are weaker than the hydrogen bonds in beta sheets - however, they are more stable and resistant to mutation.
Beta sheets, on the other hand, are made up of many beta strands- kinked sequences of residues separated by loops. These strands line up next to each other. Most beta sheets are antiparallel, meaning that the C-terminus of one strand is adjacent to the N-terminus of the next strand. Antiparallel beta sheets are the most stable because it allows the inter-strand hydrogen bonds to be planar. However, some beta sheets are parallel, meaning that the beta strands in the sheet all have the same directionality, which makes it less stable due to the hydrogen bonds being nonplanar. Some strands have a mixed bonding pattern, with some sheets being parallel and others being antiparallel; this structure is most likely less stable than a purely antiparallel beta sheet.
Certain amino acids are more likely to appear in the middle or at the end of a beta sheet. Large aromatic residues and beta-branched amino acids are more likely to be found near the middle of beta sheets - these are tyrosine, phenylalanine, tryptophan, threonine, valine, and isoleucine. Proline is often found in the edge strands of beta sheets.
Beta sheets are very strong as protective or support layers (such as the "beta-barrel" exterior of GFP).
Tertiary structure is the position in three dimensions of the secondary structures (motifs). It is determined by the secondary structures present, as well as the properties of the sidechains. Hydrophilic sidechains such as glutamine will move to the "outside" when the protein is folded in a watery environment, while hydrophobic sidechains such as tryptophan will cluster "inside" the protein, protected by other sections of the protein, to prevent their exposure to water. Oppositely charged sidechains come together, forming salt bridges (ionic bonds), while sidechains with the same charge repel each other. Cysteine, which contains sulfur, bonds covalently with other cysteines to form strong disulfide bonds. The interaction of all these attractions and repulsions cause the protein to develop a unique shape in 3D, called a "conformation".
The protein's tertiary structure also depends on the environment in which it is folded: in the human body, which is a watery environment, the hydrophobic (nonpolar) sidechains end up on the inside, as stated above. However, in a protein folded in a hydrophobic environment (such as a protein embedded in a phospholipid cell membrane), the hydrophilic (polar) sidechains end up on the inside.
Quaternary structure is the arrangement of each of the individual pieces (monomers) of a multi-unit (multimeric) protein. These subunits, or "chains" as they are often called, each have their own amino and carboxy terminus, and are not physically attached to each other. However, they are held together by bonds- which can be disulfide or ionic, although more commonly the latter- and arranged together in a specific conformation. Multimers are quite common, and may contain several distinct chains or simply several copies of the same one (or few).
Jmol is the software used to create computer models of the proteins for the Protein Modeling event.
Downloading Jmol and PDB Files
Jmol is open-source and can be downloaded here or used online. This is the pre-build online Jmol environment, which offers the same features as the online Jmol environment used for the on-site exploration. However, it only allows competitors to view the provided pre-build file. As a result, downloading Jmol to a personal computer is still advised. While it is important for competitors to familiarize themselves with the online interface, downloading Jmol allows competitors to explore any PDB file they want.
To download Jmol, go to the download page linked to above. This link provides more specific instructions and the following information is simply a summary. Please note that using Jmol on a computer requires that computer to have Java installed - see Oracle's website for further instructions.
Click the direct download link at the top of the page. This should lead to a page on Sourceforge.net that will automatically download the current version of Jmol. At the time of writing, this file should be named "Jmol-14.31.20-binary.zip". Extract the zip folder, which should contain a variety of different files and folders. The most important one is Jmol.jar. Place this file in a folder such as C:\Users\Documents\Jmol or something similar. Avoid including spaces in the name of the folder - Jmol may have issues saving and retrieving files if there are spaces in the file path. Any other files in the initial .zip folder can be deleted, as the only necessary one is Jmol.jar. Any PDB files or scripts should also be saved to this Jmol folder, as Jmol may have difficulties opening files that are not in the same folder as the main program.
[Any competitors using Linux who are familiar with the command line can also download the .tar.gz file, use "tar -xzvf" to extract, and then take Jmol.jar from the resulting directory.]
One of the main benefits to downloading Jmol is having the ability to open any desired PDB file. In order to view a desired protein, the PDB file for that protein should be present on the computer with Jmol installed. To download a PDB file, go to its structure summary page (linked to in the Year-Specific Information section) and click the "Download Files" option in the upper-right corner. A drop-down menu will appear, as shown to the right. Click on PDB Format, and save the file in the same folder as Jmol.jar.
There are two ways to enter commands in Jmol- through the command prompt or by selecting options from the drop-down menus. Although people who have never done any programming or used a command line setup before might be more comfortable with the menus, the command prompt is much easier and versatile. The command prompt is also the only way to use the online Jmol environment. Since every competitor must use the online environment in competition, it is necessary to know how to use the command line. To navigate to the command prompt on the downloaded version of Jmol, go to the File menu and click "Output Console". Some useful commands are listed in the section below.
List of Commands
These are some of the most common commands used while modeling a protein in Jmol:
- select (object)
- Selects the specified objects, making them the subject of future commands (i.e., "select *a" would select chain A of the protein, since * is the symbol for chain).
- restrict (object)
- Selects the specified objects and makes everything else disappear (i.e., "restrict *a" would leave a screen showing only chain A, which would also be selected).
- center (object)
- Centers the field of view on the specified objects. This is an incredibly useful command after a "restrict", because otherwise, the protein will easily swing out of the field of view when it is rotated.
- Undoes the last command.
- Redoes the last undone command.
- write [scriptname].spt
- Saves the current display state as a file called "[scriptname].spt" in the same folder where Jmol is. This can be useful for creating a file that shows all the sidechains that are being modeled for the prebuild, for example. It can then be loaded by navigating to File -> Open and selecting the script.
Objects that can be selected include:
- As mentioned above, the symbol for a chain is *. It should work as *[letter] or * [letter](i.e., with or without the space).
- specific residues
- Residues can be specified using their residue numbers- individual or a range- or three-letter codes. For example, "select 41" will select residue 41, "select 41-60" will select all residues from 41 to 60 inclusive, and "select his" will select ALL histidines. If selecting all histidines between residues 41 and 60 only, it is possible to string together ranges and residue types using Boolean operators (see below).
- DNA works the same way as proteins do in Jmol, in that specific parts like the backbone or nucleotides can be selected. However, the entire molecule can be selected as a whole with "select dna".
- non-protein stuff
- Everything that isn't protein (or DNA) is "hetero", as in "select hetero". This includes any ligands (such as ions or small molecules) associated with the protein, as well as water (which has its own designation as well- "hoh"). Because of the method by which protein structure data is obtained, all the surrounding water molecules are also captured. They can be quite distracting, so removing or not displaying the water might be necessary ("select hetero and not hoh"- see below for more information about Boolean operators).
- When viewing a given residue, it may not be optimal to view the entire residue including the amino and carboxy groups (or what's left of them); in most cases, a competitor would only want to see the sidechain sticking out from the alpha-carbon backbone. Use "select sidechain" (which will select all sidechains of all residues) or "select [residue] and sidechain" (see below for more information about Boolean operators) to select just the sidechain of a particular residue.
- alpha helices or beta sheets
- When folding the Mini-Toober model, it can be helpful to see where the helices and sheets are. Alpha helices can be selected with "select helix" and the beta sheets can be selected with "select sheets".
Once something is selected, the way it is displayed can be altered in various ways.
- backbone [number]
- Displays the selection in backbone mode- where only the alpha carbons of the amino acid residues are shown- with the number being the radius. This is the default display mode -- the Mini-Toober represents this part of the protein. A radius of about 300 works well.
- wireframe [number]
- Displays the selection in wireframe mode- where the connections between atoms (not just alpha carbons) are shown as lines/cylinders- with the number being the radius of the cylinders. This display mode is the most common way of viewing sidechains; a radius of about 200 works well. To remove wireframe from the selection, use "wireframe off".
- spacefill [number]
- Displays all atoms in the selection as spheres, the number being the radius of the spheres. If no number is included, Jmol will show the full Van der Waals radius of the atoms. This is another way of viewing sidechains or, more commonly, ligands such as ions. For ions, typically no number is needed (the Van der Waals radius works just fine), but for sidechains it may be advantageous to make it slightly larger than the wireframe (225-250). To remove spacefill from the selection, use "spacefill off".
- color cpk
- Colors the selection according to the "CPK" color system, which has a specific color for each element. Carbon is gray, oxygen is red, nitrogen is blue, and sulfur is yellow (these are the most common colors seen). Note: if the entire protein is in "backbone" mode, the entire protein will appear gray, because only the alpha-carbon backbone is being displayed.
- color [R,G,B]
- Colors the currently selected objects the RGB values specified (include numbers in place of R, G, and B to specify the color). There's also the easy way: instead of putting in [R,G,B], just put in a color name. Jmol only knows some of them, though (for instance, it knows "yellow", but probably doesn't know "tangerine". It might, though, since it has a pretty extensive- and somewhat bizarre- color library).
- color structure
- Colors the alpha-helices (pinkish-red) and beta pleated sheets (yellow). DNA is purple under this coloring system, and everything that has no specific secondary structure is white. Depending on the version of Jmol used, 3/10 helices may be the same color as alpha-helices, or may be a more purplish color.
- color group
- Colors the protein to display the amino and carboxy termini. The amino terminus is colored blue, and the carboxy terminus is red. Everything in between is displayed in rainbow gradations, so the part shown in green is closer to the amino terminus, while the part shown in orange is closer to the carboxy terminus.
- color chain
- Assigns a different color to each chain of the protein. This is useful mainly for determining how many chains are present in a given PDB file, or where one chain ends and another begins if they are closely associated.
Boolean operators are connecting words like "and" and "or", which determine how Jmol combines the different groups it is supposed to be acting on.
- When using "and" to connect two things, Jmol will act on only things that fit both criteria. For example, "select his and sidechain" selects only the sidechains of all the histidine residues. "And"s can also be strung together, and Jmol will act only on things that fit all the criteria: "restrict *a and 1-40 and cys" selects only cysteine residues in the interval from 1-40 of chain A.
- When using "or" to connect two things, Jmol will act on anything that fits either of the criteria. For example, "select cys or his" will select all cysteine and all histidine residues. Note: because of the way that boolean operators function in Jmol, it may be better to use "or" instead of "and" in certain scenarios. To select both cysteines AND histidines, type "select cys or his". "select cys and his" does nothing, because there are no residues that are both cysteine and histidine.
- This tells Jmol to act on everything that doesn't fit some given criteria. For example, "select not dna" selects everything in the file that isn't DNA. It can be combined with other operators, like "select hetero and not hoh", which selects all non-protein, non-DNA matter that also isn't water.
- When using "xor" to connect two things, Jmol will act on anything that fits only one of the criteria.
Parentheses can be used to combine different Boolean operators in different ways. If all operators are the same (e.g., "*a and 1-40 and his" or "his or cys or gly"), no parentheses are required; however, if there is a combination of "and"s and "or"s or want a "not" to apply to some combination of criteria, parentheses are necessary to tell Jmol which operators should be grouped together.
For example, "select (*a or *b) and his" will select all histidines in chain A and all histidines in chain B, whereas "select *a or (*b and his)* will select ALL of chain A and all histidines in chain B. If no parenthesis are included, "and" takes precedence over "or", so Jmol will assume the latter case.
It can also be confusing figuring out what the "not" applies to. If there are no parentheses present, it modifies only the object immediately following it (i.e., "select not *a or his" selects everything that isn't part of chain A, as well as all histidines, including those in chain A- one of the criteria is "not *a" and one is "his", and everything that fits at least one or the other gets selected). To have the "not" apply to more than one criteria, use parentheses after the not; for example "select not (*a or his)" will select everything except chain A that also isn't a histidine. It has the same effect as "select not *a and not his".
Folding the Mini-Toober Model
At competitions, for the on-site models, a kit will be provided consisting of Mini-Toobers, end caps, cross-linkers, and foam sidechains for the important residues that should be highlighted. For the pre-build, it is possible to buy a similar kit (minus the foam sidechains, as competitors must determine what to highlight). Alternately, the model can be made out of any other similar material (12-gauge dimensional house wire is given as an example in the rules, but anything that is sufficiently flexible but will hold its shape will suffice).
During competition and in the pre-build the Toober, end caps, and any sidechains that are given are all required parts of the model – their placement and presence count toward the final score.
The cross-linkers, however, are for stability purposes only, and do not have to represent any particular structure. Competitors do not have to use any or all of them, although during competition, it makes sense to use all the ones provided just to make the model as structurally sound as possible (the judges will be handling it frequently, and it may distort in the middle of grading if struts are not used). For the pre-build, they can just be used for stability (like they would be for an on-site build), they can represent particular bonds, (as long as this is specified on the card), or the model can be stabilized in some other way.
The Mini-Toober model will be based on the Jmol model, but unfortunately, the only way to convert the latter into the former is to use the Jmol image as a sort of "map" showing how to fold the toober.
Fortunately, CBM has an excellent step-by-step video tutorial of how to fold the Mini-Toober based on what is displayed in Jmol. This should answer any questions about creating the Mini-Toober model.
- Mini-Toober folding tutorial (mentioned above)
- Protein Data Bank
- Zinc fingers Molecule of the Month article
- General protein info
- 404ic's Protein Modeling Guide
- Celestite's 2016 SSSS Notes
- Jmol Quick Reference Sheet
|Division B: Bottle Rocket · Microbe Mission · Out and Back | Division C: Microbe Mission · Out and Back · Protein Modeling|