EuGene is a gene optimisation software. It focuses on automatically retrieving information from biological databases and using expert algorithms to redesign genes for heterologous expression.

Quick Note: Please do not be confused with the software from the Toulouse INRA labs, EuGène, for finding genes in Eukaryotic and Prokaryotic organisms using machine learning techniques.
Quick Note: New version available. Please download the new version in the download section.
What?

EuGene is a synthetic gene redesign software.

It allows opening and parsing genome files, analysing gene and protein sequences, and redesigning them according to several biological factors such as codon usage, codon context, GC content, hidden stop codons, repeated codons or nucleotides, harmonisation, deleterious sequences, etc.

EuGene facilitates common tasks related to gene characterisation and manipulation, such as identifying genes and genomes, predicting protein secondary structure, showing protein tertiary structure, performing alignments with orthologs, calculating measures such as CAI and Nc, etc.

Also, as an auto-updating application, EuGene has to be downloaded only once, all updates from there on are automatic.

How?

Numerous online biological databases supply the means to search and retrieve their data. EuGene takes advantage of these resources to obtain the maximum amount of information about a gene.

Additionally, other tools are used offline to perform complex calculations such as performing alignments and predicting protein secondary structures. This is made in an automatic manner, seamless to the user, and results are processed before being shown. Moreover, using genetic and other algorithms, EuGene is able to perform gene optimisation for expression in heterologous hosts.

All the methods and algorithms have been thoroughly studied and explored in order to obtain the best results and performance.

Acknowledgments - This project is supported by...

...the european project Mephitis, whose goal is to "significantly advance our understanding of protein synthesis in Plasmodium and to contribute to the effort on drug discovery against malaria".

...the european project Gen2Phen who aims "to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data, and to link this system into other biomedical knowledge sources (...)".

Coordination

José Luís Oliveira

Group leader and coordinator

Drives the project direction and coordinates the whole team.

Development

Paulo Gaspar

Team leader and developer

Leads the development of the application, and overall software engineering. Makes the main decisions related to future features and usability issues.

Contact me for questions and suggestions

Eduardo Sousa

Developer

The current main developer of the software. Participates in the planning, and performs implementation and deploying of new features, as well as correcting known issues.

Scientific advisory

Gabriela Moura

Main biology advisor and guide

Offers biology and genomics expertise and guidance. Suggests and controls the development of new features based in state-of-the-art literature and in-house research. Tests and validates the application in a real environment and experiments.

Manuel Santos

Biology advisor and guide

Participates in the biology discussion and provides insight into new techniques and approaches in optimizing genes.

Gene Loading

Eugene efficiently opens and interprets FASTA and GenBank files, being able to quickly read a genome, select a genetic code to decode the codons, filter unwanted sequences, and perform statistical calculations that will be used in optimisation. Opened genomes can be searched for genes by annotation.

Also, if a certain gene is not present in the loaded genome, or you are working with a new gene, there is the option of manually adding a gene to the genome.

Gene Optimisation

The main functionality of EuGene is the optimisation of gene codon sequences according to specific aspects. A set of six redesign approaches is available to customise genes.

All modifications to a gene are made without changing the amino acid sequence, i.e. codons are only replaced by synonymous ones. Moreover, changes are controlled by the genetic code and codon usage/context tables of a selectable host species, in order to ensure that the aminoacid sequence is maintained in the expression system.

The default algorithms that come with Eugene are the optimisation by codon usage, codon context, GC content, harmonisation of codon usage, control of hidden stop codons, removal of repetitions and removal of deleterious sequences (anti-SD, anti-Kozak, etc.). However the application is expandable to support new plugins to redesign genes.

To redesign genes, EuGene uses several algorithms, such as Genetic Algorithm and Simulated Annealing, combined with a Pareto archive (to keep track of the best genes). This allows two redesign approximations: a faster and a deeper.

Auto-discovery features

There is a large amount of information about genes and genomes that can be considered for synthetic design. For instance, the protein secondary structure, tertiary structure, and level of conservation among orthologs, are often analysed to infer important regions of the gene. Other forms of information, such as the identities of genes and genomes, or sets of highly expressed genes for the calculation of CAI values are often readily available or can be calculated based on available information.

These issues are approached by EuGene using an auto discovery module that analyses genes and genomes and fetches or calculates all necessary information using online databases and offline tools.

Fetched information is then used to perform gene redesigns, or shown to the user to help him/her in the analysis of the gene in question. When a user loads a genome and selects a gene, the typical workflow of the application is to contact NCBI with the gene nucleotide sequence and perform a blast to find which gene and genome it is. After that, KEGG and PDB are contacted to download orthologs, highly expressed genes and a 3D structure of the resulting protein.

Note! EuGene does not have a single way of discovering information, but rather several. If an approach fails, other strategies are used.

Intuitive user interface

We try our best at making the user experience as facilitated as possible. Starting from there, we made Eugene to perform common repetitive tasks easier and straightforward. Just by opening a genome and uploading a gene to the workspace a lot of information, such as the protein 3D structure inside the well-known molecule viewer JMol, is automatically displayed.

Colour codes

Eugene only deals with mRNA coding sequences, and every codon is colour-coded according to some scheme selected by the user. The default scheme is using codon-usage information to highlight codons from red (low usage) to green(high usage), quickly indicating what kind of codons (usage-wise) the gene is using. The scheme can be easily changed to instead show several other aspects of the gene, such as out-of-frame stop codons, nucleotide and codon repetitions, etc.


The EuGene analysis tool shows different colour scheme at once, highlighting codon usage, codon context, GC content, hidden stop codons and a repeated nucleotide.

Protein interaction

One of the most useful and unique features in EuGene is the ability to instantly see where a codon (its amino acid) is placed inside the three-dimensional conformation of the protein. Whenever you place your mouse over a codon, the corresponding amino-acid in JMol is highlighted. It is also possible to select a sub-sequence of codons to highlight a zone of the protein.

The secondary structure of the protein is also shown below the codon and polypeptide sequence. This structure is automatically calculated using the PsiPred prediction software.

Compare orthologs

EuGene automatically downloads orthologs for your genes. When orthologs are being shown, the MUSCLE aligner is used to quickly align them, and a colour code indicates the level of conservancy among species for each codon.

General info

General information regarding the selected gene is shown in the information panel. The majority of the information is fetched (using the gene discovery features) or calculated.

For instance, genome and gene names are obtained after a BLAST operation at NCBI, or any other strategy for recognizing the gene. The GC content, RSCU and CPB are calculated immediately upon opening the genome. However, the CAI and Nc are only calculated after retrieving necessary information from online databases (such as the set of housekeeping genes).

This page is under development.

When ready, it will have information regarding this particular optimisation approach and its scientific background.

It will be available soon.

This page is under-development.

When it's ready, it will have information regarding this particular optimisation approach, such as its scientific background.

It will be available soon.

This page is under-development.

When it's ready, it will have information regarding this particular optimisation approach, such as its scientific background.

It will be available soon.

This page is under-development.

When it's ready, it will have information regarding this particular optimisation approach, such as its scientific background.

It will be available soon.

This page is under-development.

When it's ready, it will have information regarding this particular optimisation approach, such as its scientific background.

It will be available soon.

This page is under-development.

When it's ready, it will have information regarding this particular optimisation approach, such as its scientific background.

It will be available soon.

This page is under-development.

When it's ready, it will have information regarding this particular optimisation approach, such as its scientific background.

It will be available soon.

Other and new optimisation approaches

More design and optimisation approaches will be developed constantly as research advances and our understanding of the protein biosynthesis process grows. We work at the state-of-the-art level in the comprehension of genes, and therefore as soon as new discoveries are made, we develop new design strategies for Eugene and make them available to the end user.

Also, if you are interested in developing a new design strategy for Eugene, contact us and we will tell you how.

Tutorial

In this quick tutorial we teach you how to start using Eugene. The tutorial is divided into seven steps that will allow you to complete a simple task: open a Plasmodium falciparum (malaria parasite) gene and optimize some aspects for its expression in Escherichia coli.

Note! If you want to quick start by testing by yourself, try uploading this genome to Eugene and then using the tufA1 gene. as it has many online resources that can make you see several of Eugene's features.


Step 1 - The main window

When you open EuGene for the first time you find a layout that looks like the image to the right.

This is the main window of Eugene, and it is vertically divided in 3 zones. The zone on the left has two panels: the top one, titled Gene Optimisation is where you control the future redesign of your genes (we will come to that later), and the bottom one, Progress Panel shows running and terminated tasks.

The centre zone, which is initially empty, is where the genes you want to work with will show up. This is your workspace. You can see at the bottom a separator saying "Project 1", that is the default project we already created for you.

The right zone also has two panels: the top one, titled Protein Viewer is where the 3D protein conformation of the selected gene is shown using JMol; and a bottom one, Gene Information Panel, which will display additional information regarding the selected gene.

Step 2 - Loading genomes

Note! Besides the three zones, you can find a menu bar at the top of the main window where many functions can be accessed. For instance, in View → Hide Information Zone you can make the working space larger by hiding the Protein viewer and Gene Information panels.


Now, go to the menu Gene Pool → Open Gene Pool or optionally File → Open Genome. This pops up a new window (as shown on the right) called Gene Pool, where you load genomes into the application (and consequently genes).

You can follow the instructions at the centre of the window, but we will guide you through anyway: To open a genome, click the Load genome button which will open another smaller window (shown below). In this window you have a drop-down menu where a genetic code can be selected. This genetic code will be used to interpret the genome file that will be opened.

The standard genetic code is selected by default. Below the drop-down menu there are three check-boxes corresponding to the filters that can be used to ignore unwanted genes from the genome that will be opened. For instance, if the No start codon filter is selected, any genes that do not have a start codon (according to the selected genetic code) will be ignored. Finally, click the Open genome button and select a genome file to open (here is an example genome file).

A progress bar will show up, indicating the progress of the loading process. When the genome is loaded, the application will ask you for a genome name (it is preferable that you write the real species name, as it can help the application in finding information).

Note! If your genome is separated in several chromossome files, just select all the files in the file-opening dialog. The files will then be opened together as a single genome.

Step 3 - Uploading a gene to the workspace

Your opened genome will be shown in the gene pool as a new tab, and all non-filtered-out genes will be displayed in a table. Each row has a name and a size, corresponding to the annotation of the gene and its size in codons.

As an example we opened the Plasmodium falciparum genome, which contains the genes we want to work with. Also, we repeated all the steps to open the Escherichia coli genome, since we want to express our genes in that host (see the image on the right). You can open as many genomes as you wish. Note that everytime you open a genome, several statistical calculations are immediatelly made, and their progress is shown in the Progress panel.

Now it's time to select a gene (or several) to work with. There is a search zone on the top that allows looking for a gene in the genome with a specific text in its annotation. As you type, the gene list of the active genome will be instantly filtered. After finding the gene you want, select it by clicking on its entry in the table and then pressing the Upload selected gene to workspace button, or, alternatively, double click the entry. For the chosen example, we searched for a specific gene from Plasmodium falciparum, and uploaded it to the workspace.

A new panel will show-up in the centre zone of the main window, where you can see the codon sequence and the decoded amino-acid sequence. This is a gene panel (see image below). Also note that the application started some tasks automatically, such as downloading the protein 3D structure (which will show automatically, if available), finding gene and genome names, and obtaining orthologs for your gene. The codons are coloured (by default) according to their codon usage, varying from red (lower usage) to green (higher usage).

Note! You can place your own genes into an already loaded genome. Just click the Add gene to this genome link in the gene pool and follow the instructions.

Step 4 - Optimising a gene for heterologous expression

You can upload as many gene as you want to the workspace. To select a gene in the workspace, simply click on it (it will be highlighted from the others). With your gene selected, you can see information about it in the Gene Information Panel, such as its GC content or CAI (note that to calculate CAI, the application first automatically downloads house-keeping genes for that genome. That might take a while, but you can see the progress in the Progress Panel.).

At the top of the Gene Optimisation panel (image on the right) there is a drop-down menu where you can select the target expression host for your gene. This menu holds all the genomes that you previously loaded into the gene pool. For the sake of the example, the Escherichia coli entry was selected.

Now you decide how you want the gene to be optimised. For the chosen example, I want to maximize the codon usage AND remove any out-of-frame stop codons of my gene (as a hypothetical experiment). Thus, I select the check-boxes for Codon Usage and Hidden Stop Codons. Clicking on the left arrow () of each of those optimisation approaches, you can find their options. There, I selected Maximize codon usage and RSCU for the codon usage and Minimize hidden stop-codons for the hidden stop codons.

After having selected both expression host and optimisation approaches, press the Redesign Gene button to start the optimisation ( and optionally select the Fast checkbox for faster results). Depending on the size of the gene and the number of selected optimisation approaches, the optimisation task can take up to several minutes.

Step 5 - Analysing optimisation results

For the sake of the tutorial, we chose the Fast option. As soon as the optimisation finishes, a new gene panel is created with a yellow title-bar (see image below). The title also says that the species of the gene is Escherichia coli, because we selected it as our target host species. If you select the newly created panel and look at the gene information panel, you will notice a new section called Optimisation report where the results of the optimisation are stated.

In the example (image below on the right), the report says that codon usage was improved 44,1% and hidden stop-codons 54,3%, reaching final individual scores (compared to the best and worst possible individual scores) of 92,5% and 100% respectively, which is very close to individual maxima. That means that for hidden stop-codons, the best possible outcome was achieved.

To further analyse a gene, Eugene allows the user to perform a Gene diagnosis using the available optimisation approaches. For our experiment, we want to analyse the GC content, and hidden stop-codons of the optimised gene.

Thus, we selected the check-boxes of those optimisation approaches in the Gene Optimisation panel (and unchecked all others) and then pressed the Gene Diagnosis button. A new panel is created in the workspace (above the gene panels) with the resulting analysis of the two approaches. From the analysis one can verify that there are only 3 out-of-frame stop codons, whereas there are 73 in the original gene. We can also see a significant increase in GC content as a consequence of expressing in E.coli.

Step 6 - Final features

Note! The analysis is made using the colour schemes from each optimisation method. Though the default colour scheme for gene panels is using codon usage, you can change it by right clicking on top of the codon sequence and selecting Select Colour Scheme → Some optimisation approach.

Though the main goal of Eugene is redesigning genes, other features include calculating a prediction of the secondary structure of the protein. For that, select you gene panel, and go to Edit → Show/Hide protein secondary structure. The predicted structure (made of α-Helices, β-strands and coils) is shown in the gene panel after being calculated.

Another feature is the ability to fetch and align orthologs. As soon as you upload your gene to the workspace, Eugene contacts KEGG and downloads orthologs (if available). When this process terminates, you can select you gene panel and go to Edit → Show/Hide orthologs.. EuGene then uses the MUSCLE aligner to align the fetched orthologs and show them in your gene panel, colour-coded according to conservancy.

Moreover, if EuGene found a 3D protein representation for your gene (which is displayed in the protein viewer panel) you can easily map your codons and amino acids from the sequence to the 3D representation. To do so, just place your mouse cursor over an amino acid in the gene panel and see it highlighted in the 3D protein. You can also select a range of codons (just like selecting text) and they will be highlighted as well.

The codon-selecting behaviour can also be used to optimise only specific zones of a gene. Thus, before starting an optimisation, select the range of codons you wish to optimise (or avoid being optimised).

Step 7 - Saving the work

There are two forms of saving your work. The first and the simplest one is by right-clicking on top of your gene panel, and selecting Copy codon sequence or, if you need the amino acid sequence Copy amino acid sequence, which will copy the corresponding sequence text into the clipboard.

The second form is saving your project. You can do this by selecting the menu Project → Save Project which will open a dialog where you can select where to save your project to. This option will save all genes that are open in the workspace, and also remember which genome files you had open in your Gene Pool.

To load a saved project, go to Project → Load Project and select the project file you previously saved. This will re-create all the gene panels you had in the workspace, and will also load all genomes that you previously loaded into your gene pool.

Note! EuGene is constantly being developed and corrected. You will receive these developments automatically each time you start EuGene. If you find a problem or unusual behaviour, please send us a small description, and it is very likely that the next time you open EuGene the problem is already solved. Suggestions are also very welcome!

Download

EuGene is made in Java. That means you will only have to download EuGene once, and every update with new features and corrections will be automatically made everytime you reopen the application.

Although you can use it offline, EuGene uses a lot of internet resources such as online biological databases, and therefore it is preferential that you are connected to the internet while using it.

Click here to download Eugene

Note: EuGene is available for Windows (XP, Vista, 7, 8.1), Mac and Linux

Development

EuGene is under constant development (improved every day!).

Version 1.0 current progress:

Future features

This is (by far!) not the complete list of features that we have planned, but only a small sub-set that we would like to share.

Gene Diagnosis
A much more complex gene diagnosis, to identify issues with the gene and suggest optimisations.
Gene Comparison
A gene comparison feature to perceive the evolution of an optimised gene when compared with its original version
Manual redesign
Allow manually changing the gene, adding/removing/altering codons.
Better faster optimisation
An even better optimisation engine, to obtain pareto front results at a fast pace.
More optimisation options
More gene redesign strategies, such as mRNA secondary structure optimisation.
Show more information about genes
Fetch and show more information about genes and genomes, such as related literature.
Availability to other OS
Make eugene available to other operating systems: MacOS and Linux.

Known issues

These are small issues that we are aware of and are currently working on to correct. As corrections are made and deployed everyday, it is highly likely that this list is outdated.

  • Some optimisation approaches change codons unnecessarily, that is, even if they don't play a role in the optimisation result.

  • Genomes in the gene pool cannot be closed. This is not a bug, but the lack of a feature.

  • Orthologs are obtained from Kegg using a Kegg code build up from the first letters of the organism. This is not the rule for a few organisms in Kegg.