Protein Modeling

From Wiki -
Revision as of 19:10, 21 December 2020 by Jaspattack (talk | contribs) (wew more cleanup)
Jump to: navigation, search

Protein Modeling is a Division C event for the 2021 season. It was introduced as a trial event in 2009 and as an official national event in the 2010-2011 season, 2011-2012 season, 2014-2015 season, 2015-2016 season, 2018-2019 season, and the 2019-2020 season. For the event, students use computer visualization and online resources to guide the construction of physical models of proteins and to understand how protein structure determines function. In 2016, students modeled proteins involved with dopamine and serotonin. In 2019, students studied the CRISPR-Cas9 and one of its inhibitors, AcrIIA4. In 2020 and 2021, students are studying the APOBEC3A Cytidine Deaminase and how it is used in direct base editing.

Previous Topics

Season Topic
2011 Pluripotent Stem Cells
2012 Apoptosis
2015 Human Genome
2016 Dopamine and Serotonin
2019 CRISPR and Anti-CRISPR
2020 CRISPR and Base Editing

Event Overview

In past years, competitors would build 2 models - one prebuild model and one onsite build. However, in 2020 this was changed and instead of an onsite model, students will explore any given protein using Jmol and answer questions regarding its structure. The prebuild still remains part of the competition, as well as a written test on a specific topic.

For the 2021 season scoring is based on the following components:

  • Prebuild model: 40%
  • Jmol Exploration: 30%
  • Written Test: 30%

Allowed Materials

In the 2016-2017 season, five two-sided 8.5x11 sheets of reference materials from any source, typed, handwritten, or graphics, were allowed, along with a scientific calculator and writing utensils (a Sharpie or other permanent marker is recommended for marking the Mini-Toober for the on-site model, as well as normal pens/pencils for taking the test).

For the 2018-2019 season, only one two sided letter-size "cheatsheet" is allowed. Calculators are not allowed this year. In addition to markers, pens and pencils, teams are also encouraged to bring a metric ruler to aid in building the onsite model to the 2cm scale.

For 2020 and 2021, each participant may bring a sheet of paper - if there are three individuals competing in the event, each one may have a sheet of information. However, calculators are still not allowed.

Pre-Build Model

For the pre-build model, a section of a certain protein will be specified from the Protein Data Bank. Students purchase beforehand a Mini-Toober of the correct length, at a scale of 2 cm per amino acid residue, and red and blue plastic end-caps representing the carboxyl and amino termini of the protein chain respectively. The model must be correctly folded, and the end-caps placed at the correct ends of the protein chain, but the model must also include "creative additions" that showcase the function of the selected protein. Students must decide for themselves which sidechains and/or ligands are important to the protein's function and should be displayed. These additions must be explained on a two-sided, 3x5 index card, along with a paragraph explanation of how they are relevant to the protein's function, and submitted with the model, typically at impound.

The pre-build model is scored based on:

  • Accuracy and correct placement of secondary structures (alpha helices and beta sheets)
  • Accuracy of tertiary structure (3D arrangement of secondary structures- e.g., two strands of a beta sheet being correctly placed next to each other, and a nearby helix being perpendicular to both)
  • Relevance and correct placement of sidechains or ligands displayed on the model (and clarity/accuracy of the explanation on the card), as well as correct placement of end-caps
  • Relevance and creativity of "creative additions" (i.e., labeling the "fusion peptide" on hemagglutinin would receive points, but displaying every sidechain wouldn't, since that doesn't show any understanding of the protein's function)

For an example of how specific the pre-build model has to be (and of what constitutes good "creative additions"), see the 2010 National Tournament Pre-build Rubric and the 2011 National Tournament Pre-build Rubric.

On-Site Model

For the on-site model, a file would be specified from the Protein Data Bank, but the specific section of the protein that students would have to build would not be stated until the competition. At the competition, students would be given a Mini-Toober of the correct length, at the same scale of 2 cm per amino acid residue, foam representations of important sidechains or associated ligands, and red and blue plastic end-caps representing the carboxyl and amino termini of the protein chain respectively. The model had to be correctly folded, with the end-caps placed at the correct ends of the protein chain and the given sidechains or ligands attached to the Mini-Toober at the correct locations and with the correct orientations.

The on-site model was scored based on:

  • Accuracy and correct placement of secondary structures (alpha helices and beta sheets)
  • Accuracy of tertiary structure (3D arrangement of secondary structures- e.g., two strands of a beta sheet being correctly placed next to each other, and a nearby helix being perpendicular to both)
  • Correct placement and orientation of given sidechains and ligands, as well as correct placement of end-caps

For an example of what exactly supervisors looked for in the on-site model, see the 2010 National Tournament On-site Rubric and the 2011 National Tournament On-site Rubric.

On-Site Jmol Exploration

For the on-site model, a file will be specified from the Protein Data Bank, and students will be expected to use the Jmol viewing environment to answer questions about its structure. The protein(s) selected may or may not be related to the current topic. The student may also be asked about the protein's function.

Sample Questions

A student may be asked:

  • How many sulfur atoms are in this protein?
  • Where are the polar and nonpolar sidechains located in this protein?
  • How many polypeptide chains does this protein contain?
  • What is the function of this protein?

Written Test

The written test consists of multiple choice and short answer questions about both general protein topics (e.g., "What is the full name of the N-terminus of a protein chain?"), information specific to the selected protein (e.g., for the 2009-10 hemagglutinin, "Why are pigs important to the spread of influenza viruses?"), and information extremely specific to the PDB file (e.g., "What is the resolution in Angstroms of the PDB file 1HTM.pdb?").

There are relatively few questions on the test- for example, the 2010 National test consisted of 10 multiple choice and 5 short answer questions. Each multiple choice question was worth one point, and the short answer questions were worth 20 points total (each question was assigned its own point value, which was printed on the test). A similar number of questions and mode of scoring can be expected on future tests. Some of the short answer questions will be selected as tiebreaker questions, and will be designated as such. These still count normally toward the regular score on the test portion, but will be weighted more in the event of a tie.

See the Test Exchange for exams and answer keys from the 2010 and 2011 National Tournaments.

These attributes are consistent across all levels of competition, because the Center for Biomolecular Modeling at MSOE provides all tests for the Protein Modeling event. However, the questions (at least ostensibly) get more challenging at each successive level of competition.

Protein Structure

"Structure equals function" is the basic tenet of Protein Modeling: i.e., it's important to know what a protein's structure is like because its function is determined by its structure.

There are four different types of protein structure: primary, secondary, tertiary, and quaternary.

Primary Structure

Primary structure is the sequence of amino acid residues in a protein chain. When the amino acids bind to become proteins, they lose a hydrogen from the amino group and a hydroxide ion from the carboxylic acid group through dehydration synthesis. The amino acids are called residues because when a peptide bond is formed by dehydration synthesis, a molecule of water leaves the newly formed peptide, leaving behind dehydrated versions, or residues, of the amino acids bonding. There are 20 main varieties of amino acid, which differ only in their sidechain (sometimes called an "R group"). Each amino acid is known by three names - the full name, usually derived from what the amino acid was first isolated from; the three-letter abbreviation; and a one-letter abbreviation. For example, tyrosine (from Greek tyros for cheese) is also known as Tyr and Y.

The 20 major amino acid sidechains. For a PDF version: File:AASK Sidechain List.pdf

The central carbon to which the carboxylic acid, amine, and R-group is bound serves as a chiral center, meaning that amino acids can exist in two different forms, the L-enantiomer and the D-enantiomer. The L-enantiomer is by far the most prevalent form in nature. It is important to remember that since enantiomers are not superimposable, D-enantiomers cannot substitute L-enantiomers in function.

Amino acids are zwitterions, which are neutral species that can carry both a positive and a negative charge. In solution, the proton on the carboxylic acid end will generally dissociate and protonate the amine end, creating a negative charge on the newly formed carboxylate end and a positive charge on the amine end. This protonation and deprotonation is pH dependent, so the charges of amino acids can change over a pH range, one of the reasons why altering pH of a protein can affect the activity of the protein.

Different residue sidechains have different properties. One of the most important properties of an amino acid is its polarity, which is highly dependent on the sidechain. Polar functional groups (hydroxides, amines, sulfhydryls) as well as charged functional groups increase the polarity of an amino acid, while the lack of these functional groups decreases polarity. Polarity is a major factor in determining hydrophobicity. Water is highly polar and will interact via intermolecular attractions with other polar molecules, so polar amino acids are hydrophillic. In contrast, water does not interact much with nonpolar molecules, so nonpolar amino acids are generally hydrophobic.

The charge of a functional group also affects the properties of an amino acid. For example, the red sidechains in the diagram are negatively charged, and the blue ones are positively charged. Charges are determined by which of the three forms of the amino acid is most prevalent at physiological pH. Recall the definition of a Bronstead-Lowry acid/base pair - acidic species donate protons to water to form hydronium whereas basic species accept protons from water to form hydroxide. An amino acid that is negatively charged at neutral pH has been fully deprotonated, so they are commonly referred to as "acidic". Similarly, amino acids that are positively charged have generally been protonated and are known as "basic". These properties determine how the protein folds (i.e., the secondary and tertiary structure), because certain types of residues attract, repel or bond to other types of residues. Also, the types of residues present can determine how the protein interacts with other molecules such as DNA- for example, serine can form hydrogen bonds, and therefore is often found at binding sites in a protein.

Charged sidechains repel like charges and attract opposite charges. Hydrophilic sidechains usually end up on the outside of a folded structure, because most proteins fold in an aqueous environment and the polar sidechains interact well with water. For the same reason, hydrophobic sidechains usually end up on the inside of the structure, because they do not interact well with water. In addition, hydrophobic sidechains often interact with other hydrophobic sidechains in what is termed hydrophobic interaction.

Cysteine, which is shown in green in the diagram, is a special amino acid. When two cysteine residues are oxidized, a disulfide bridge is formed between their side chains, forming a molecule named cystine. Disulfide bridges help many proteins maintain their structure by covalent bonding.

Each residue in a chain is given a number, starting at the amino terminus (that is, the end that has an amino group still present) with the lowest number (which is not always 1, depending on the numbering conventions for the particular family of proteins) and going up to the carboxy terminus (the end that has a carboxyl group still present).

Secondary Structure

Secondary structure is the first level of folding in a protein. Patterns called "motifs", such as alpha helices and beta sheets (by far the two most common), are caused by hydrogen bonding between the backbone carbons (the central carbons of amino acids, also known as alpha carbons) of the residues.

An alpha helix (hydrogen bonds shown)
A beta sheet (direction of chain shown)

Alpha helices are slightly more common in proteins overall than beta sheets. These helices are tightly coiled single strands, kept in place by hydrogen bonds between nearby residues. They can be anywhere from only a few residues in length to over 100 Angstroms in some proteins. They tend to be the base of protein "stalks" (such as that of 2009-10's influenza hemagglutinin).

Beta sheets, on the other hand, are made up of many beta strands- kinked sequences of residues separated by loops. These strands line up parallel to each other- actually, antiparallel, which means that adjacent strands point in opposite directions (direction matters, remember, because of the numbering of residues from the amino terminus to the carboxy terminus)- with multiple hydrogen bonds between adjacent strands. They are very strong as protective or support layers (such as the "beta-barrel" exterior of GFP).

Tertiary Structure

Tertiary structure is the position in three dimensions of the secondary structures (motifs). It is determined by the secondary structures present, as well as the properties of the sidechains. Hydrophilic sidechains such as glutamine will move to the "outside" when the protein is folded in a watery environment, while hydrophobic sidechains such as tryptophan will cluster "inside" the protein, protected by other sections of the protein, to prevent their exposure to water. Oppositely charged sidechains come together, forming salt bridges (ionic bonds), while sidechains with the same charge repel each other. Cysteine, which contains sulfur, bonds covalently with other cysteines to form strong disulfide bonds. The interaction of all these attractions and repulsions cause the protein to develop a unique shape in 3D, called a "conformation".

The protein's tertiary structure also depends on the environment in which it is folded: in the human body, which is a watery environment, the hydrophobic (nonpolar) sidechains end up on the inside, as stated above. However, in a protein folded in a hydrophobic environment (such as a protein embedded in a phospholipid cell membrane), the hydrophilic (polar) sidechains end up on the inside.

Quaternary Structure

Quaternary structure is the arrangement of each of the individual pieces (monomers) of a multi-unit (multimeric) protein. These subunits, or "chains" as they are often called, each have their own amino and carboxy terminus, and are not physically attached to each other. However, they are held together by bonds- which can be disulfide or ionic, although more commonly the latter- and arranged together in a specific conformation. Multimers are quite common, and may contain several distinct chains or simply several copies of the same one (or few).

One chain shown in blue; the other shown in green.

Using Jmol

Jmol is the software used to create computer models of the proteins for the Protein Modeling event.

Downloading Jmol and PDB Files

Jmol is open-source and can be downloaded here or used online. This is the pre-build online Jmol environment, which offers the same features as the online Jmol environment used for the on-site exploration. However, it only allows competitors to view the provided pre-build file. As a result, downloading Jmol to a personal computer is still advised. While it is important for competitors to familiarize themselves with the online interface, downloading Jmol allows competitors to explore any PDB file they want.

To download Jmol, go to the download page linked to above. This link provides more specific instructions and the following information is simply a summary. Please note that using Jmol on a computer requires that computer to have Java installed - see Oracle's website for further instructions.

Click the direct download link at the top of the page. This should lead to a page on that will automatically download the current version of Jmol. At the time of writing, this file should be named "". Extract the zip folder, which should contain a variety of different files and folders. The most important one is Jmol.jar. Place this file in a folder such as C:\Users\Documents\Jmol or something similar. Avoid including spaces in the name of the folder - Jmol may have issues saving and retrieving files if there are spaces in the file path. Any other files in the initial .zip folder can be deleted, as the only necessary one is Jmol.jar. Any PDB files or scripts should also be saved to this Jmol folder, as Jmol may have difficulties opening files that are not in the same folder as the main program.

[Any competitors using Linux who are familiar with the command line can also download the .tar.gz file, use "tar -xzvf" to extract, and then take Jmol.jar from the resulting directory.]

One of the main benefits to downloading Jmol is having the ability to open any desired PDB file. In order to view a desired protein, the PDB file for that protein should be present on the computer with Jmol installed. To download a PDB file, go to its structure summary page (linked to in the Year-Specific Information section) and click the "Download Files" option in the upper-right corner. A drop-down menu will appear, as shown below. Click on PDB File (Text), and save the file in the same folder as Jmol.jar.


There are two ways to enter commands in Jmol- through the command prompt or by selecting options from the drop-down menus. Although people who have never done any programming or used a command line setup before might be more comfortable with the menus, the command prompt is much easier and versatile. The command prompt is also the only way to use the online Jmol environment. Since every competitor must use the online environment in competition, it is necessary to know how to use the command line. To navigate to the command prompt on the downloaded version of Jmol, go to the File menu and click "Output Console". Some useful commands are listed in the section below.

List of Commands

These are some of the most common commands used while modeling a protein in Jmol:

select (object)
Selects the specified objects, making them the subject of future commands (i.e., "select *a" would select chain A of the protein, since * is the symbol for chain).
restrict (object)
Selects the specified objects and makes everything else disappear (i.e., "restrict *a" would leave a screen showing only chain A, which would also be selected).
center (object)
Centers the field of view on the specified objects. This is an incredibly useful command after a "restrict", because otherwise, the protein will easily swing out of the field of view when it is rotated.
Undoes the last command.
Redoes the last undone command.
write [scriptname].spt
Saves the current display state as a file called "[scriptname].spt" in the same folder where Jmol is. This can be useful for creating a file that shows all the sidechains that are being modeled for the prebuild, for example. It can then be loaded by navigating to File -> Open and selecting the script.

Objects that can be selected include:

As mentioned above, the symbol for a chain is *. It should work as *[letter] or * [letter](i.e., with or without the space).
specific residues
Residues can be specified using their residue numbers- individual or a range- or three-letter codes. For example, "select 41" will select residue 41, "select 41-60" will select all residues from 41 to 60 inclusive, and "select his" will select ALL histidines. If selecting all histidines between residues 41 and 60 only, it is possible to string together ranges and residue types using Boolean operators (see below).
DNA works the same way as proteins do in Jmol, in that specific parts like the backbone or nucleotides can be selected. However, the entire molecule can be selected as a whole with "select dna".
non-protein stuff
Everything that isn't protein (or DNA) is "hetero", as in "select hetero". This includes any ligands (such as ions or small molecules) associated with the protein, as well as water (which has its own designation as well- "hoh"). Because of the method by which protein structure data is obtained, all the surrounding water molecules are also captured. They can be quite distracting, so removing or not displaying the water might be necessary ("select hetero and not hoh"- see below for more information about Boolean operators).
When viewing a given residue, it may not be optimal to view the entire residue including the amino and carboxy groups (or what's left of them); in most cases, a competitor would only want to see the sidechain sticking out from the alpha-carbon backbone. Use "select sidechain" (which will select all sidechains of all residues) or "select [residue] and sidechain" (see below for more information about Boolean operators) to select just the sidechain of a particular residue.
alpha helices or beta sheets
When folding the Mini-Toober model, it can be helpful to see where the helices and sheets are. Alpha helices can be selected with "select helix" and the beta sheets can be selected with "select sheets".

Once something is selected, the way it is displayed can be altered in various ways.

backbone [number]
Displays the selection in backbone mode- where only the alpha carbons of the amino acid residues are shown- with the number being the radius. This is the default display mode -- the Mini-Toober represents this part of the protein. A radius of about 300 works well.
wireframe [number]
Displays the selection in wireframe mode- where the connections between atoms (not just alpha carbons) are shown as lines/cylinders- with the number being the radius of the cylinders. This display mode is the most common way of viewing sidechains; a radius of about 200 works well. To remove wireframe from the selection, use "wireframe off".
spacefill [number]
Displays all atoms in the selection as spheres, the number being the radius of the spheres. If no number is included, Jmol will show the full Van der Waals radius of the atoms. This is another way of viewing sidechains or, more commonly, ligands such as ions. For ions, typically no number is needed (the Van der Waals radius works just fine), but for sidechains it may be advantageous to make it slightly larger than the wireframe (225-250). To remove spacefill from the selection, use "spacefill off".

Coloring Modes

color cpk
Colors the selection according to the "CPK" color system, which has a specific color for each element. Carbon is gray, oxygen is red, nitrogen is blue, and sulfur is yellow (these are the most common colors seen). Note: if the entire protein is in "backbone" mode, the entire protein will appear gray, because only the alpha-carbon backbone is being displayed.
color [R,G,B]
Colors the currently selected objects the RGB values specified (include numbers in place of R, G, and B to specify the color). There's also the easy way: instead of putting in [R,G,B], just put in a color name. Jmol only knows some of them, though (for instance, it knows "yellow", but probably doesn't know "tangerine". It might, though, since it has a pretty extensive- and somewhat bizarre- color library).
color structure
Colors the alpha-helices (pinkish-red) and beta pleated sheets (yellow). DNA is purple under this coloring system, and everything that has no specific secondary structure is white. Depending on the version of Jmol used, 3/10 helices may be the same color as alpha-helices, or may be a more purplish color.
color group
Colors the protein to display the amino and carboxy termini. The amino terminus is colored blue, and the carboxy terminus is red. Everything in between is displayed in rainbow gradations, so the part shown in green is closer to the amino terminus, while the part shown in orange is closer to the carboxy terminus.
color chain
Assigns a different color to each chain of the protein. This is useful mainly for determining how many chains are present in a given PDB file, or where one chain ends and another begins if they are closely associated.

Boolean Operators

Boolean operators are connecting words like "and" and "or", which determine how Jmol combines the different groups it is supposed to be acting on.

When using "and" to connect two things, Jmol will act on only things that fit both criteria. For example, "select his and sidechain" selects only the sidechains of all the histidine residues. "And"s can also be strung together, and Jmol will act only on things that fit all the criteria: "restrict *a and 1-40 and cys" selects only cysteine residues in the interval from 1-40 of chain A.
When using "or" to connect two things, Jmol will act on anything that fits either of the criteria. For example, "select cys or his" will select all cysteine and all histidine residues. Note: because of the way that boolean operators function in Jmol, it may be better to use "or" instead of "and" in certain scenarios. To select both cysteines AND histidines, type "select cys or his". "select cys and his" does nothing, because there are no residues that are both cysteine and histidine.
This tells Jmol to act on everything that doesn't fit some given criteria. For example, "select not dna" selects everything in the file that isn't DNA. It can be combined with other operators, like "select hetero and not hoh", which selects all non-protein, non-DNA matter that also isn't water.
When using "xor" to connect two things, Jmol will act on anything that fits only one of the criteria.
Combining Operators

Parentheses can be used to combine different Boolean operators in different ways. If all operators are the same (e.g., "*a and 1-40 and his" or "his or cys or gly"), no parentheses are required; however, if there is a combination of "and"s and "or"s or want a "not" to apply to some combination of criteria, parentheses are necessary to tell Jmol which operators should be grouped together.

For example, "select (*a or *b) and his" will select all histidines in chain A and all histidines in chain B, whereas "select *a or (*b and his)* will select ALL of chain A and all histidines in chain B. If no parenthesis are included, "and" takes precedence over "or", so Jmol will assume the latter case.

It can also be confusing figuring out what the "not" applies to. If there are no parentheses present, it modifies only the object immediately following it (i.e., "select not *a or his" selects everything that isn't part of chain A, as well as all histidines, including those in chain A- one of the criteria is "not *a" and one is "his", and everything that fits at least one or the other gets selected). To have the "not" apply to more than one criteria, use parentheses after the not; for example "select not (*a or his)" will select everything except chain A that also isn't a histidine. It has the same effect as "select not *a and not his".

Folding the Mini-Toober Model

At competitions, for the on-site models, a kit will be provided consisting of Mini-Toobers, end caps, cross-linkers, and foam sidechains for the important residues that should be highlighted. For the pre-build, it is possible to buy a similar kit (minus the foam sidechains, as competitors must determine what to highlight). Alternately, the model can be made out of any other similar material (12-gauge dimensional house wire is given as an example in the rules, but anything that is sufficiently flexible but will hold its shape will suffice).

During competition and in the pre-build the Toober, end caps, and any sidechains that are given are all required parts of the model – their placement and presence count toward the final score.

The cross-linkers, however, are for stability purposes only, and do not have to represent any particular structure. Competitors do not have to use any or all of them, although during competition, it makes sense to use all the ones provided just to make the model as structurally sound as possible (the judges will be handling it frequently, and it may distort in the middle of grading if struts are not used). For the pre-build, they can just be used for stability (like they would be for an on-site build), they can represent particular bonds, (as long as this is specified on the card), or the model can be stabilized in some other way.

The Mini-Toober model will be based on the Jmol model, but unfortunately, the only way to convert the latter into the former is to use the Jmol image as a sort of "map" showing how to fold the toober.

Fortunately, CBM has an excellent step-by-step video tutorial of how to fold the Mini-Toober based on what is displayed in Jmol. This should answer any questions about creating the Mini-Toober model.

Useful Links

Mini-Toober folding tutorial (mentioned above)
Protein Data Bank
Zinc fingers Molecule of the Month article
General protein info
404ic's Protein Modeling Guide
Celestite's 2016 SSSS Notes
Jmol Quick Reference Sheet
2010 National Tournament Trial Events
Division B: Helicopter · Model This · Optics | Division C: Helicopter · Protein Modeling · Sumo Bots