Codon Optimisation: A Hidden Process Behind IndieBB

I've posted twice recently after a prolonged blogging absence, and here I am again. Perhaps I should always be running a crowdfunding campaign, so that I have a stake in blogging; I'd be far more prolific! I'm too principled (or, to some “dogmatic”) to include advertisement on this or any blog, and flattr revenue is far too thin to encourage more than the occasional post otherwise.

In any case, today I'd like to share something with you all that I've been meaning to write up for ages anyway, but which becomes especially relevant in light of my IndieBB DIYbio/Biohacking/Teaching plasmid design project.

This is the story of how I was faced with a synthetic biology design problem for which there were few acceptable existing solutions, and how it lead me to design my own software for gene design. This software is being used to design IndieBB, so you might have a stake in it already!

Reminder: Like the preceding post, this post gets into the technical details of DNA design. It is not necessary that IndieBB users understand any of this; the kit is designed for total beginners!

Green Fluorescent Protein: Little Green Cells

One of the included genes in IndieBB is the “Green Fluorescent Protein” (GFP) from the jellyfish A.victoria, a widely used biotechnology tool which, when expressed within a cell in sufficient quantity, makes a cell visibly fluorescent. So, colonies of E.coli will be green under a full-spectrum or ultra-violet (UV-A) strip-lamp, allowing you to see you've succeeded in making engineered cells.

When the gene for GFP (which, as a protein, had been known of for long before transgenics emerged) was first discovered, it was a great day for biotech. Finally, a protein that doesn't need to be extracted from cells to detect; you can just illuminate the cells, and notice that they're more fluorescent than before! Things got even better when, several years later, an enhanced variant of GFP called “eGFP” was discovered; this was several times more fluorescent and, in modest quantity, was clearly visible by eye. It also had a better excitation spectrum (the range of light frequencies which can make it fluoresce), meaning handheld UV-LEDs could be used instead of UV-striplamps.

Green Fluorescent Protein expressed in transgenic cells
(Image credits: Wikipedia)

Sadly, while GFP is out of patent by now (it was ludicrous in the first place that a patent could be obtained on a natural gene), eGFP is still burdened by several patents which will not expire for a few years. This means any reasonable hope of making an open source product on eGFP is dead in the water until those patents die, and in the mean-time we're stuck with GFP.

“Natural” or “wild type” GFP, being significantly less fluorescent than eGFP, needs to be expressed (produced) in cells in much greater quantity, and to achieve this in synthetic biology we do one or both of two things: place the gene under the control of a strong promoter, and codon optimise the gene. The first is a matter of copy and paste, but the latter is far more involved and murky.

Codon Optimisation

If you're well-read on DNA, you may know that a “gene” is a section of DNA that instructs a cell how to, and under what circumstances to, produce a protein. The protein then goes on to be or do something useful for the cell.

The protein itself is coded for by a sub-set of the gene called the “Coding Sequence”, usually shortened to “CDS”. This happens through a process called “transcription”, where the DNA, starting near the “promoter” region and ending near the “terminator” region is copied to a molecule of RNA. This RNA transcript is then translated according to the RNA code by a protein:RNA complex called a ribosome. Translation uses triplets (codons) of RNA nucleotides to assemble single amino acids into a chain of amino acids, which is called a “protein”.

The ribosome, and the RNA code, are two of the strongest indicators that all known life on earth evolved from a common origin; they are both highly similar across all known species. The RNA code varies between groups of life usually in only one or two places at a time, and even then many of the differences are circumstantial.

However, while the language remains the same, the dialect differs widely. While jellyfish like A.victoria and bacteria like E.coli broadly agree on which three-nucleotide “words” (codons) code for which amino acids, they differ greatly in their relative usage of words which may code for the same amino acid. The RNA code is redundant, meaning that for some amino acids there are many different codons which are used..but some species don't use some codons at all, whereas other species might use those same codons exclusively. Trying to copy a gene between distantly related species may fail simply because the gene contains “words” which the target species no longer understands, or doesn't like to use under normal circumstances.

Enter codon optimisation, one of synthetic biology's darker arts. In principal, if the relative usage of codons can make a protein perform worse in one species than another, then the reverse is also true. Using alternative codons to encode each amino acid of a CDS can make the cell make the protein in higher quantities, or while depleting fewer scarce codons that might be needed for other genes.

As a bonus, while you are re-writing the CDS of a gene anyway, you can specify extraneous things you'd like to do at the same time. Perhaps you want to purge your CDS of target sites for a particular (or many) enzyme, or you'd like to prevent any sequences being created which could form strong secondary structures (areas where DNA folds up into origami) that might interfere with the translation of the RNA code. Let's call all of this stuff-you-don't-want “excluded sequences” or “excludes”.

But from that simple principal there is a blossoming of complexity. What's the best way to optimise genes? At first, people assumed a “best pick” method would work well; analyse the target species, identify the codons that species uses most often, and then always use that codon unless it would create some excluded sequence, in which case you fall back to the second-best, and so on.

Best pick is better than nothing in cases where a gene otherwise won't work at all. But, the weight of evidence appears to suggest that it doesn't usually improve gene expression when the gene already works acceptably, on average. Some genes improve, others actually get worse.

An alternative method came into favour which attempted instead to match the relative codon usage of the gene with the relative codon usage of the target cell. So, if upon inspection a species likes to use the four available codons for a particular amino acid in a 4:3:2:1 ratio, then you try to match that ratio in your target gene.

This system actually works better, but it's still not perfect (though nothing ever will be). But, better refinements weren't long coming, and there are now several ways of optimising that function, basically, on a random allotment of codons according to a desired final frequency. Some prefer to choose codons according to their relative usage in a subset of cellular proteins believed to be very important, or highly expressed. In at least once case an empirical study determined the best codons to use by mutating many, many copies of some “test” genes and assessing which codons were associated with high expression, though this is expensive and time consuming, and has only been done for E.coli so far.

But at this point, the method remains the same, with the precise details being left to a codon usage table (CUT). Using this method, you specify your gene, and your list of excludes, and your CUT, and your program hopefully churns out a better copy of the gene for the target species, ready to be ordered from a gene synthesis company.

Things seemed fairly stable for a while, until the whole thing got murkier again. First, it became apparent that, once codon scarcity was removed as the primary cause of expression problems, the next most common problem with designed genes was the presence of secondary structures in the CDS region of the gene: the aforementioned “DNA/RNA origami” that can cause RNA templates to fold up, concealing their beginning from eager ribosomes. It would later be demonstrated that, generally speaking, this was most true of the “leader” region where the ribosomes bind and pick up momentum, and less true of the later regions. This is actually (apparently) because the ribosomes themselves, once bound, help to flatten out the RNA and prevent structures from forming.

Once the structure issue was better documented, some bright spark, noticing a few details in “natural” codon usage in wild genes and some contradictory details in custom-made genes, asked whether the speed with which a ribosome translates an RNA template can at times be too fast, the opposite of what everyone else assumed. The assumption made was that, if ribosomes all start at high velocity as soon as they find and stick to an RNA template, then they'll be thinly spaced as they all progress along the template. If the lead ribosome pauses for some reason; say, it finds a codon that's used rarely, or the amino acid it needs is suddenly not available when required, then you could get a Ribosome pile-up!

Studies into this actually bore out the idea that ribosomal collisions could occur, and could lead to problems expressing a protein that was, according to the best available knowledge of the day, “highly optimised”. This has become known as the “on-ramp” hypothesis; for good outcomes, the early portion of an RNA template should actually be less optimal than the rest, so that ribosomes start out slow before speeding up later, allowing them to space out slightly and preventing collisions. This appears to be what most wild genes with high transcription rates do, and it appears to be beneficial for designed genes, too.

With me so far? So, the state-of-the-art in gene design today looks like this:

  1. Pick codons according to a frequency table, in preferential order of empirically-determined-to-be-best, used-in-highly-expressed-genes, or overall-frequency-in-the-target-organism's-genome.
  2. Try to avoid secondary structures in the region that becomes the RNA transcript, particularly in the initial portion.
  3. Try to make the initial portion of the RNA transcript be less optimal than the rest, to create an on-ramp for ribosomes.
  4. Do all of this while avoiding a list of sequences that you've determined, for structural, efficiency or convenience reasons, to be better off excluded from the final sequence.

Got that? Now the problem; nothing like this existed.

Codon Optimisation Software Woes

When I began designing genes, I used a tool called jcat, which had a handy web-interface, to design some genes for expression in B.subtilis (readers of prior posts will recall that I was attempting something like IndieBB in that species back then, and that I've promised to relate the full story during the crowdfunding campaign).

jcat is a best-pick tool; it prefers to use the most-commonly-used codon for the target species whenever possible. It could exclude sequences from the final design, but seemingly only by specific sequence, rather than using the more useful generic IUPAC-notation (where an expanded set of letters allow you to specify “A or T” or “not G” instead of having to use only A, C, T or G).

I also made use of the gene optimisation system provided as a free service for customers (more on this trend later) by Mr Gene, a since defunct (I think) gene synthesis company, which also appeared to make use of the best-pick system for design. It, too, would allow specific sequences (not IUPAC) to be excluded from the target design, but no IUPAC.

As I became aware that best-pick was being discredited, I started looking around for more up-to-date systems, but I found that all the newer available tools I could find that specified which system they used (best pick or CUT-based) were proprietary, and had very specific terms of use; most were provided by gene synthesis companies who insisted that you buy the resulting gene from them! My more antiauthoritarian users may be like me in balking at this, and may further suggest “screw that, design and buy elsewhere anyway”. But, dear reader, even if I were scurrilous enough to do so (:)), it's possible that a proprietary program could be embedding a “watermark” in the resulting DNA which is provably linked to the system, and could later result in a burden of intellectual property should the resulting DNA become useful or wildly successful. Also, I'm a purist; why didn't good software exist for codon optimisation?

So, I wrote my own CUT-based codon optimisation system, and as a bonus I included code that would not only permit exact sequence exclude-lists, but also extended IUPAC-notation exclude-lists. So instead of specifying three separate excludes ["TAG", "CAG", "GAG"], you could just say “BAG”, meaning [not-A]AG. I called this python codon optimisation system PySplicer, because it was written in Python and it eventually made of of a splicing system to rapidly triage a good-ish gene from hundreds of initial candidates, prior to more involved gene editing to resolve excluded sequences.

The first PySplicer was a raw CUT-based design tool with exclude lists, written in the most awful, incomprehensible sort of code. It was one huge script, with one huge Python Class containing tens of methods which each handled a small part of the job, and cross-called one another, sometimes recursively. It was impossible to understand unless you'd written it, and indeed several months later, when I sat down to rewrite the whole thing, I was challenged to understand my own thought process, too! But, at this point I'd fulfilled parts 1 and 4 from the above list of “current best practice in CDS design”.

When I later wanted to rewrite and improve PySplicer, I had by then learned of the on-ramp hypothesis and the importance of DNA/RNA structure to gene expression. I wanted to create the best available gene design software, and that meant implementing all four of the above features. Also, I wanted the code to be not-terrible, so others could see how it worked and help improve it as we learned more about gene design in future. I had been digging into structure analysis tools written for DNA that were written in a now-rare language called FORTRAN, and found them so difficult to read and understand that I felt pity for anyone in my position someday!

Rewriting PySplicer to fill out the list above is a story in computer coding, and I won't burden the reader, presumably a biology and not programming enthusiast (but possibly both), with the full details.

Suffice to say, fulfilling problem #2, structure analysis, required that PySplicer forfeit being a “pure python” program, because structure analysis is considered one of the harder problems in bioinformatics and requires highly optimised computer code. Not to mention that re-writing it in Python would have required half a year's work! So instead PySplicer requires that you install a well-regarded and current package written in C called “ViennaRNA", and then hands off that part of the gene design process to ViennaRNA.

Problem #3 was easier; merely altering the codon usage tables used for the initial portion of the gene suffices to reduce the overall level of optimisation, providing the on-ramp.

Testing PySplicer, Designing a Free/Libre GFP

Of course, programming a codon optimisation tool (by now, it ought to be properly considered a “CDS” optimisation tool I suppose) is meaningless if it's not used to design genes for testing or use. I follow the philosophy that one should eat one's own dog-food; I use the tools I propose to others. So, I used PySplicer in a project which involved Green Fluorescent Protein, which I introduced in the above section.

The results were very favourable. While wild-type GFP was often lamented for its poor fluorescence in the available scientific literature, often to the point that it was impossible to see and hard to distinguish, my optimised wild-type GFP was clearly and satisfyingly fluorescent to the naked eye under a small blacklight, even in a brightly lit room and without any filters. This is the green fluorescent protein shown in the crowdfunding video, in fact:

A comparison of E.coli cells bearing PySplicer optimised wtGFP under incandescent and UV illumination. Cells are already somewhat aged, with reduced fluorescence overall and particularly in regions of dense growth.]
A comparison of E.coli cells bearing PySplicer optimised wtGFP under incandescent and UV illumination. Cells are already somewhat aged, with reduced fluorescence overall and particularly in regions of dense growth.

So, if you've followed me this far, from start to finish, I hope you'll have a sense of the amount of “invisible” effort that has gone into IndieBB already. In the end of the day, IndieBB (if successful in raising funds) will be a black-box beginner's project, requiring no knowledge of its working in order for you and others to use it and learn from it. But to provide that level of ease has required a lot of effort, time and experience from me. And along the way, I've already been creating and releasing valuable tools for other synthetic biologists (and perhaps you) to use.

 Share, if you like. Fediverse sharing is preferred, though.