José Luis Oliveira
Universidade de Aveiro, DETI / IEETA
3810-193 Aveiro, Portugal
(+351) 234 370 500
José Luis Oliveira
Universidade de Aveiro, DETI / IEETA
3810-193 Aveiro, Portugal
(+351) 234 370 500
EAGLE is an alignment-free method and associated program to compute relative absent words (RAW) in genomic sequences using a reference sequence. Currently, EAGLE runs on a command line linux environment, building an image with patterns reporting the absent words regions (in SVG) as well as reporting the associated positions into a file. EAGLE has got scripts to run on the current outbreak and the other existing ebola virus genomes (using the human as a reference), including the download, filtering and processing of the entire data.
Raquel M. Silva, Diogo Pratas, Luísa Castro, Armando J. Pinho & Paulo J. S. G. Ferreira. Bioinformatics (2015): btv189.
Luis Bastião Silva, “A federated architecture for biomedical data integration”
Universidade de Aveiro, DETI/IEETA
Nevertheless, the method is not limited to primates information. The following image show the information map between Meleagris gallopavo and Gallus gallus chromosomes 1 using a threshold of 0.95.
Carlos Ferreira, “Handling Data Access Latency in Distributed Medical Imaging Environments”
Universidade de Aveiro, DETI/IEETA
Date: 2015.04.10, 10.00 AM
Anfiteatro, Reitoria, Universiade de Aveiro
Paulo Gaspar, “Computational methods for gene characterization and genome knowledge extraction”
Universidade de Aveiro, DETI/IEETA
MENT is a set of tools for lossless compression of microarray images, however, it can be used in other kind of images such as medical, RNAi, etc. This set of tools is divided into two categories, defined by the decomposition approach used:
In what follows, we will describe the set of tools available in MENT:
If you use some tool from MENT, please cite the following publications:
SACO was designed to handle the DNA bases and gap symbols that can be found in MAF files. Our method is based on a mixture of finite-context models. Contrarily a recent approach, it addresses both the DNA bases and gap symbols at once, better exploring the existing correlations. For comparison with previous methods, our algorithm was tested in the multiz28way dataset. On average, it attained 0.94 bits per symbol, approximately 7% better than the previous best, for a similar computational complexity. We also tested the model in the most recent dataset, multiz46way. In this dataset, that contains alignments of 46 different species, our compression model achieved an average of 0.72 bits per MSA block symbol.
If you use this software, please cite the following publications:
MAFCO is a lossless compression tool specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from ≈ 34% to ≈ 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. MAFCO was designed and implemented at IEETA, a research unit of the University of Aveiro, and is available for non-commercial use.
Luís M. O. Matos, António J. R. Neves, Diogo Pratas and Armando J. Pinho. “MAFCO: a compression tool for MAF files”. PLoS ONE 10(3): e0116082.
Systems that aggregate health and entertainment goals are proliferating, but little is known about the way to design and evaluate these systems and how to manage the different (if nor opposite) needs of these two main areas. This workshop will promote the discussion of issues surrounding these areas, enabling a better understanding of the how’s and why’s of designing systems for health and entertainment, as well as the identification of new avenues of research in the field.
Therefore we invite designers, researchers and practitioners to participate in an exciting full-day workshop where they are invited to share their personal views and research on the intersection of technology, health and entertainment.
More information at http://designingsystemsforhealthandentertainment.wordpress.com/.
The 6th International Symposium on Semantic Mining in Biomedicine (SMBM)
6th-7th October, 2014 will be held at the University of Aveiro, Portugal.
SMBM aims to bring together researchers from text and data mining in biomedicine, medical, bio- and chemoinformatics, and researchers from biomedical ontology design and engineering. SMBM 2014 is the follow-up event to SMBM 2012 (University of Zürich, Switzerland) SMBM 2010 (EBI, U.K.), SMBM 2008 (University of Turku, Finland), SMBM 2006 (University of Jena, Germany), and SMBM 2005 (EBI, U.K.).
More information at http://www.smbm.org.
Luis Ribeiro, “Platform for on-demand exchange of medical imaging communities”
Universidade de Aveiro, DETI/IEETA
David Campos, “Term expansion methodologies in biomedical information retrieval”
Universidade de Aveiro, DETI/IEETA
The FCT Investigator Programme aims to create a talent base of scientific leaders, by providing 5-year funding for the most talented and promising researchers, across all scientific areas and nationalities.
For the 2013 call, Sérgio Matos, research assistant at IEETA, was awarded a FCT Investigator grant, for the 2014-2018 period.
DNAatGlance is a program for the detection of large-scale genomic regularities by visual inspection. Several discovery strategies are possible, including the standalone analysis of single sequences, the comparative analysis of sequences from individuals from the same species, and the comparative analysis of sequences from different organisms. The software was designed and implemented at IEETA, a research unit of the University of Aveiro, and is available for non-commercial use.
Armando J. Pinho, Sara P. Garcia, Diogo Pratas, Paulo J. S. G. Ferreira (2013) DNA Sequences at a Glance. PLoS ONE 8(11): e79922.
For convenience, we provide a sequence (here in gzip)(here in zip) and the corresponding information profile in WIG format (here in gzip) (here in zip) that can be uploaded to the UCSC Genome Browser as a custom track.
MFCompress is a compression tool for FASTA and multi-FASTA files. In comparison to gzip and applied to multi-FASTA files, MFCompress can provide additional average compression gains of almost 50%, i.e., it potentially doubles the available storage, although at the cost of some more computation time. On highly redundant data sets, and in comparison with gzip, 8-fold size reductions have been obtained. MFCompress was designed and implemented at IEETA, a research unit of the University of Aveiro, and is available for non-commercial use. For other uses, please send an email to email@example.com.
Armando J. Pinho, and Diogo Pratas. “MFCompress: a compression tool for FASTA and multi-FASTA data.” Bioinformatics 30.1 (2014): 117-118.
Egas is a web-based platform for biomedical text mining and collaborative curation. The web tool allows users to annotate texts with concept occurrences as well as with relations between concepts. Annotations can be performed manually or based on the results of automated concept identification and relation extraction tools. These automatic annotations may have been previously added to the documents, using one of the accepted input formats, or may be added during the annotation process, by calling a document annotation service. Users can inspect, correct or remove automatic text mining results, manually add new annotations, and export the results to standard formats.
Funding entity: QREN MaisCentro
Period: Feb.2013 – Dec.2014
The projects’ ambition is the creation of a new set of solutions based in novel ICT technologies, developing a concept that encompasses the synergistic usage of cloud computing, with large database access and information retrieval, associated with advanced methods for reasoning and data mining (and with the basic scalable algorithms to support the dimensions of the data sets targeted).
Funding entity: QREN MaisCentro
Period: Feb.2013 – Jun.2015
Neurodegenerative disorders are a major health concern worldwide, Portugal being no exception. With this project the University of Aveiro proposes extend existing research in the field of neurodegenerative diseases through the creation of a consortium of 5 research units from UA (CBC, QOPNA, I3N, IEETA, CICECO). The projects main goal is to offer novel therapeutic strategies to tackle the complex array of existing neuropathologies. By building a multidisciplinary research team that combines experts in molecular neuropathologies, proteomics, metabolomics, bioinformatics, neuronal networks, organic synthesis and drug design from the UA we will be able to attack the problem on many fronts. Upon successful completion of this project, new therapeutic approaches will have been developed which will contribute to the improvement of life quality for neurodegenerative patients, having a high society impact considering the 10 million new patients reported every year.
The price was awarded at BioLINK SIG 2013 for the work “Neji: a tool for heterogeneous biomedical concept identification”.
BioLINK SIG 2013: Roles for text mining in biomedical knowledge discovery and translational medicine
The Annual Meeting of the ISMB BioLINK Special Interest Group
In Association with ISMB/ECCB 2013, Berlin, Germany
July 20, 2013
A 6-hour iOS Development Seminar will be held by Rui Pedro Lopes, Professor at Polytechnic Institute of Brangança, on the 29th July 2013, at Department of Electronics, Telecommunications and Informatics (DETI), Aveiro.
This Seminar will cover the following main topics: Objective-C, Storyboards, Core Data, Master-Detail User Interface
Universidade de Aveiro, DETI, Room 102, 10h
Variobox is a desktop tool for the annotation, analysis and comparison of human genes. Variant annotation data are obtained from WAVe, protein metadata annotations are gathered from PDB and UniProt, and sequence metadata is obtained from Locus Reference Genomic (LRG) and RefSeq databases. By using an advanced sequence visualization interface, Variobox provides an agile navigation through the various genetic regions. Researched genes are compared to the sequences retrieved from LRG and RefSeq, automatically finding and annotating new potential mutations. These features and data, ranging from patient sequences to HGVS-valid variant description up to pathogenicity evaluation, are combined in an intuitive interface to explore genes and mutations.
To cite this tool use the following publication:
Variobox: Automatic Detection and Annotation of Human Genetic Variants. Paulo Gaspar, Pedro Lopes, Jorge Oliveira, Rosário Santos, Raymond Dalgleish, José Luís Oliveira. Human Mutation, 2014
VarioBox is available for all the main operating systems (Windows [XP, 7, 8]; Linux; MacOS) that support Java. The current version of the software is 1.4.4. Click the link bellow to download:
To run, first unpack all the files to any folder. Then, if you’re on Windows, double click the Variobox file inside the folder. On Mac or Linux, start a terminal, change the directory to the created folder, and run java -jar variobox.jar
This is the initial VarioBox workspace that shows up when you open the application. At the bottom of the workspace you can find a separator, “Home”, created automatically. Here will be as many separators as searches performed, each one identified by the searched HGNC code. At the centre you can see the logo and a panel, where searches for reference genes can be performed, using a valid HGNC symbol. To work with Variobox, a reference gene is always the starting point. After obtaining the reference, a sequence can be loaded to the application to be aligned with the sequence, and analysed.
By default there are two genes bellow the search box: Collagen, type I, alpha 1 (COL1A1), and Myotubularin 1 (MTM1). Click on COL1A1 or type it at the search box and hit search. A progress bar will show up indicating the progress of the loading process. A new tab (with the name of the searched HGNC code), like the one below, will show up once the reference gene is automatically retrieved from the web servers:
The right zone is formed by two distinct panels:
On the top of the window there is a large genomic viewer with a movable and resizable window that allows specifying a region to be explored in the centre zone. This viewer distinguishes exons (blue) and introns (purple), and allows quickly jumping through the gene. The centre zone is populated with gene data and information, in three distinct panels, described below:
In this panel you can see the codon sequence and the decoded polypeptide sequence, labelled Reference Sequence and Translated Sequence respectively, and also the Known Mutations for the gene, as retrieved from WAVe. A zoomed genomic viewer is also displayed to further facilitate the exploration of the gene.
Mutations are identified by different colours, and shown next to the corresponding nucleotides. Additional information about a mutation can be obtain by clicking on the mutation. The Information Panel (right side of the workspace) will display details regarding the selected mutation’s position, source, type, annotation, etc.
The Navigation Panel also permits filtering what mutation types are to be shown in the Gene Panel. For instance, if you only check Substitutions, all mutations besides SNPs will be hidden.
This panel shows you a quick information about the gene that you are analysing. The current information supported is the following:
To load a gene sequence and align it with the reference gene, click the menu Genes → Load gene file. Alternatively, go to the menu File → Load gene file. You will be prompted with a new window to select the file you want to load. For the current version we support the file types:
After selecting the file (or files, if you choose the forward-reverse format), click Load selected file and VarioBox will read them. Once the file is correctly loaded, an alignment with the reference gene is automatically performed. This alignment will also display found mutations, as compared to the reference gene. The analysis of the loaded sequence is described in the next step.
After the files are loaded, the Gene Panel will be updated with the mutated sequence as well as the calculated mutations, as depicted in the following figure:
The loaded sequence will also be coloured according to its chromatogram confidence (if there is one), ranging from green (high confidence) to red (no confidence). This will allow easily understanding the validity of calculated mutations. Also note that the mutations are automatically annotated using the standard notation, and its annotation is displayed when clicking on a mutation. To save the sequences, mutations, alignment and other information, the gene should be assigned to a patient. To do so, go to the menu Genes → Save to patient and select a patient from the list of patients that will be presented.
If you want to register a new patient in VarioBox, make the following steps: Go to Patients → New patient and fill the Patient Details panel (shown bellow) with all the required information(note that only one field is mandatory). After that just click Save patient and a new record will be created.
To load a saved project, go to Patients → Open patient and select the patient you previously saved. This will create a new tab with all the patient information: patient personal information as well as the genes from that patient. Those genes can be open just by selecting them and clicking Open selected.
This action will open many tabs as many genes you have selected and will re-create all the gene panels you had in the workspace previously.
Closing tabs is as simple as going to Patients → Close patient or Genes → Close current gene project depending of the tab type you have open.
becas is a web application, API and widget for biomedical concept identification. It helps researchers, healthcare professionals and developers in the identification of over 1,200,000 biomedical concepts in text and PubMed abstracts.
becas provides annotations for isolated, nested and intersected entities. It identifies concepts from multiple semantic groups, providing preferred names and enriching them with references to public knowledge resources. You can choose the types of entities you want to identify and highlight or mute specific entities in real-time.
To facilitate annotation of PubMed abstracts, becas automatically fetches publications from NCBI servers and renders them with identified concepts highlighted.
You can access the becas web annotation tool here and learn to use it in its help page. Explore the Web API in the API docs and discover how easy it is to integrate the becas widget in the widget docs.
Diseasecard is a public web portal that integrates real-time information from distributed and heterogeneous medical and genomic databases, presenting it in a familiar visual paradigm.
Bioinformatics is playing a key role on molecular biology advances, not only by enabling new methods of research, but also managing the huge amounts of relevant information and make it available world-wide.
State of the art methods on bioinformatics include the use of public databases to publish the scientific breakthroughs. These databases provide valuable knowledge for the medical practice. But, given their specificity and heterogeneity, we cannot expect the medical practitioners to include their use in routine investigations. To obtain a real benefic from them, the clinician needs integrated views over the vast amount of knowledge sources, enabling a seamless querying and navigation.
Main goals behind the conception of DiseaseCard:
Diseasecard can provide the answers to several questions that are relevant in the genetic diseases diagnostic, treatment and accomplishment, such as:
The EU-ADR Web Platform helps experts in the study of adverse drug reactions (ADRs) through the use of computational services and scientific workflows, provided by several European partners. The system assists in the earlier detection of adverse drug reactions, improving drug safety and contributing to public health benefit. You can access the EU-ADR Web Platform here
The overall objective of this project was the design, development and validation of a computerized system that exploits data from electronic healthcare records and biomedical databases for the early detection of adverse drug reactions. Visit the project page.
Talk from Luís Paquete, Anf. IEETA
The multiobjective formulation of the pairwise sequence alignment problem is introduced, where a vector score function takes into account the substitution score and indels or gaps separately. Two solution methods are introduced: a multiobjective dynamic programming that extends classical algorithms for this problem and an epsilon-constraint algorithm that solves a series of constrained sequence alignment problems. A state pruning technique based on the concept of bound sets is also presented. Finally, its application to phylogenetic tree construction is
Universidade de Aveiro, Anf. IEETA, 14h30
Funding entity: IMI-JU
In recent years, the development and use of Electronic Healthcare Records (EHRs) throughout Europe has grown exponentially resulting in large volumes of clinical data. At the same time, large collections of disease‐specific data are recorded – in local, regional and/or national settings. Researchers also follow specific cohorts over time, and focus on specific types of data such as imaging or genetic data. Other researchers are building biobanks that aim to combine clinical data with genetic data. As a result, individual patients can contribute to multiple, often separate, data sources.
This project combines the topic of generating a common patient health Information Framework (IF) with addressing the two Research Topics (RT’s) Obesity and its metabolic complications and Markers for the development of Alzheimer’s disease (AD) and other dementias.
Funding entity: FP7-HEALTH-2012-INNOVATION-1
Despite examples of excellent practice, rare disease (RD) research is still mainly fragmented by data and disease types. Individual efforts have little interoperability and almost no systematic connection between detailed clinical and genetic information, biomaterial availability or research/trial datasets. By developing robust mechanisms and standards for linking and exploiting these data, RD-Connect will develop a critical mass for harmonisation and provide a strong impetus for a global “trial-ready” infrastructure ready to support the IRDiRC goals for diagnostics and therapies for RD patients.
Neji is an innovative framework for biomedical concept recognition. It is open source and built around four key characteristics: modularity, scalability, speed, and usability. It integrates modules of various state-of-the-art methods for biomedical natural language processing (e.g., sentence splitting, tokenization, lemmatization, part-of-speech tagging, chunking and dependency parsing) and concept recognition (e.g., dictionaries and machine learning). The most popular input and output formats, such as Pubmed XML, IeXML, CoNLL and A1, are also supported. Additionally, the recognized concepts are stored in an innovative concept tree, supporting nested and intersected concepts with multiples identifiers. Such structure provides enriched concept information and gives users the power to decide the best behavior for their specific goals, using the included methods for handling and processing the tree.
Concept recognition is an essential task in biomedical information extraction, presenting several complex and unsolved challenges. The development of such solutions is typically performed in an ad-hoc manner or using general information extraction frameworks, which are not optimized for the biomedical domain and normally require the integration of complex external libraries and/or the development of custom tools. Thus, Neji fills the gap between general frameworks (e.g., UIMA and GATE) and more specialized tools (e.g., NER and normalization), streamlining and facilitating complex biomedical concept recognition.
On top of the built-in functionalities, developers and researchers can implement new processing modules or pipelines, or use the provided command-line interface tool to build their own solutions, applying the most appropriate techniques to identify names of various biomedical entities. Neji was built thinking on different development configurations and environments: a) as the core framework to support all developed tasks; b) as an API to integrate in your favorite development framework; and c) as a concept recognizer, storing the results in an external resource, and then using your favorite framework for subsequent tasks.
Universidade de Aveiro, Anf. Ambiente, 14h
Pedro Lopes, “Service Composition in Biomedical Applications”
Universidade de Aveiro, DETI/IEETA
Dr. Kim Sneppen from the Niels Bohr Institute, Copenhagen-DK, will give the give the inaugural Lecture of our Systems Biology seminars series entitled Simplified Models of Biological Networks, on the 28th of September.
Universidade de Aveiro, Anf. Ambiente, 14h
The mRNA optimiser is a tool that redesigns a gene messenger RNA to optimise its secondary structure, without affecting the polypeptide sequence. The tool can either maximize or minimize the molecule minimum free energy (MFE), thus resulting in decreased or increased secondary structure strength.
The optimisation is achieved by using an heuristic to look for synonymous gene sequences, and select the ones with the best secondary structure. Evaluations of the secondary structure are made using a correlated stem-loop prediction algorithm that examines the nucleotide sequence for simple stem-loops. This algorithm is fine-tuned to have its results highly correlated with the MFE evaluations of RNAfold.
Our results indicate that an average of over 40% increase in MFE can be obtained with this method. Also, since there is a tendency to reduce the GC percentage of nucleotide sequences when optimising, the developed tool includes an option to maintain the GC content of the wildtype gene.
P. Gaspar, G. Moura, M. A. S. Santos, and J. L. Oliveira mRNA secondary structure optimization using a correlated stem–loop prediction Nucleic Acids Research, Jan 2013, doi: 10.1093/nar/gks1473
Select your operating system:
Current version is 1.0.
The mRNA optimiser is a command line tool (a graphical interface will be available soon). To use it you need to open a terminal window, change to the directory where mRNAOptimiser is, and run it:
You may choose to supply your mRNA sequence by writing it into the terminal or referring an input file, with the -f input_sequence option. The tool only changes the coding region of the mRNA, therefore you must indicate where the start codon begins (-b index, to indicate the index of the first nucleotide of the start codon) and where the stop codon ends (-e index, to indicate the index of the last nucleotide of the stop codon). The default coding zone is the entire sequence.
To redirect the output results to a file, use the -o output_file option. To choose whether the tool should maximize or minimize the MFE, use the -d type option (default is maximize). You may limit the algorithm in both time and number of iterations by using the options -t max_time and -i max_iterations. Also, the tool will use the standard genetic code by default, but you can select other genetic coding tables using the -c coding_table option.
To maintain the original mRNA percentage of guanine and citosine (GC content) unaltered after optimisation, use the -g option. There is also a quiet mode, where nothing is output except for the resulting sequence, using the -q option.
Any questions and suggestions are welcome
OralCard is an online bioinformatic tool that comprises results from manually curated articles reflecting the oral molecular ecosystem (OralPhysiOme), by merging the experimental information available from the oral proteome both of human (OralOme) and microbial origin (MicroOralOme). OralCard is a key resource for understanding the molecular foundations implicated in biology and disease mechanisms of the oral cavity.
OralCard integrates information about more than 3500 proteins and searching can be performed in three distinct views: (1) by protein names or respective UniProt codes, (2) by disease name, OMIM code or MeSH term, (3) and by organism.
Nuno Rosa, “From the Salivar Proteome to Oralome”
Universidade Católica Portuguesa, Viseu
International School on Semantic Web Applications and Technologies for the Life Sciences 2012
May 2nd – 5th, 2012
Located at the University of Aveiro,
More information online at http://www.swat4ls.org/schools/aveiro2012/
Helena Deus, “Linked Data and Semantic Web Technologies for improving discovery in the Life Sciences”
We live in a world of data. This is also true for the Life Sciences, where the introduction of omics technologies such as genome sequencing has led to the industrialization of data production beyond a craft-based cottage industry and into a deluge of biological information. Nevertheless, the apparently simple task of collecting and keeping pace with the latest information about a gene of interest is still thwarted by the need for biological researchers to become experts at database-surfing and literature mining.
Linked Data is a set of principles devised for creating a Web of Data where a new generation of Web applications can discover and link relevant pieces of information based on its properties rather than its location in a database. Linked data is also at the root of a movement towards building a knowledge continuum in the Life Sciences and by doing so, has the potential to be a foundation for a platform that will support 21st century Biology.
In this talk, I will present some of the scenarios where Linked Data has been successfully applied in accelerating scientific discovery and translation of Life Sciences knowledge into Health Care and what challenges are still to be addressed.
Helena Deus Bio at http://lenadeus.info
III WORKSHOP DE RED IBEROAMERICANA DE TECNOLOGÍAS CONVERGENTES NBIC EN SALUD (IBERO-NBIC) – CYTED Program
Hotel Moliceiro, Aveiro, Portugal
October 10-11, 2011
Day 1: Monday, 10
9h00 – 9h30: Opening and Welcome
José Luís Oliveira, DETI/IEETA, Universidade de Aveiro, Portugal
Acto de apertura del III Workshop Internacional Redes Ibero-NBIC y NanoRoadmap
Alejandro Pazos & Julián Dorado, Universidade da Coruña, España
9h30 – 11h15:
Vacunología inversa aplicada en malaria
Raúl Isea, IDEA, Fundación de Estudios Avanzados, Venezuela
Bioinformatics, research and applications
Sergio Guíñez Molinos, UCBSM, Universidad de Talca, Chile
Tecnologías NBIC y Nanotoxicidad: Gestión del conocimiento asociado al uso de nanopartículas en medicina
Diana de la Iglesia, GIB, Universidad Politécnica de Madrid, España
Tecnologías de la Información y el Conocimiento en Salud. Un Sistema Basado en Ontologías para el Apoyo a la Toma de Decisión en UCIs
Ana Freire, Universidade de Coruña, España
11h15 – 11h30: Coffee Break
11h30 – 13h00:
Integration of heterogeneous biomedical names taggers
David Campos, DETI/IEETA, Universidade de Aveiro, Portugal
Collecting and Enriching Human Variome Datasets
Pedro Lopes, DETI/IEETA, Universidade de Aveiro, Portugal
Doctoral Program in Nanosciences and Nanotechnology of the University of Aveiro
Tito Trindade, DQ, Universidade de Aveiro, Portugal
13h00 – 14h30: Lunch
14h30 – 16h00:
Integración de la información molecular en un Sistema de Informacion en Salud
Segunda etapa: estándares y control de calidad.
Carlos Otero, HIBA, Buenos Aires, Argentina
Connecting different levels of biological information. From atoms to people
Guillermo López, ISCIII – Instituto de Salud Carlos II, Madrid, España
Posibles aportes de una empresa de Educación Médica Continua a una red de investigación en Salud
Antonio López, EVIMED, Uruguay
16h00 – 16h30: Coffee Break
16h30 – 18h00:
Internal Meeting / Reunión Interna de la Red
Day 2: Tuesday, 11
Visit to Instituto Ibérico de Nanotecnologia (Braga)
The recognition of named entities is a crucial initial task of biomedical text mining. A number of NER solutions have been proposed in recent years, taking advantage of different resources and/or techniques. Currently, the best results are achieved by combining the output of different systems. However, little effort has been spent in such harmonisation solutions, being specific to a corpus and/or non-knowledge based.
Totum is a innovative harmonisation solution based on Conditional Random Fields, which were trained on several manually curated corpora. Thus, we avoid the single corpus dependency, supporting several biomedical domains and organisms. In the end, Totum harmonises gene/protein annotations provided by several heterogeneous NER solutions, following the gold standard requirements.
Considering a corpus that contains the test parts of the four corpora, the experiments show that Totum improves the F-measure of state-of-the-art tagging solutions by up to 10% in exact alignment and 22% in nested alignment. Finally, Totum achieves an F-measure of 70% (exact matching) and 82% (nested matching) against the same corpus.
COEUS main web server is down for maintenance. It will be online again on February 27th, 2013. Thank you for your patience.
Ipsa scientia potestas est. Knowledge itself is power.
Streamlined back-end framework for rapid semantic web application development.
Use Semantic Web & LinkedData technologies in all application layers.
Enable reasoning and inference over connected knowledge.
Access data through with LinkedData interfaces and deliver a custom SPARQL endpoint.
Reduce development time. Get new applications up and running much faster using the latest rapid application development strategies.
Use COEUS advanced API to connect multiple nodes together and with any other software.
Create your own knowledge network using SPARQL Federation enabling data-sharing amongst a scalable number of peers
Launch your custom application ecosystem. Distribute your data to any platform or device.
Reach more users and create new semantic cloud-based software platforms.
The Human Variome relates to genomic mutations and their effects on particular phenotypes. This critical life sciences research field has grown greatly in recent years, mostly due to the appearance of projects such as the Human Variome Project or the European GEN2PHEN Project. Nonetheless, locus-specific mutation databases and included variants are far from being standardized and widely used in the research community workflow. With WAVe, we offer centralized and transparent access to these databases, combined with the integration of found variants in a single system that is enriched with the most relevant gene-related information in a user-friendly web-based workspace.
WAVe provides a comprehensive set of features that will improve bioligists’ workflow when researching in the genomic variation field.
Searching for genes only requires that users start typing the gene HGNC-approved symbol in any of the available search boxes. This event will trigger the automatic suggestion system that will offer various solutions based on users’ input. Following one of the suggestions leads directly to the gene view interface. When a suggestion is not accepted and there is more than one match, WAVe will display the gene browse interface, containing only the results matching the provided query.
Querying for * lists all genes as well as available LSDBs and variants for each gene. In this gene browse scenario, searches for a particular gene can be performed, in real time, by typing in the table search box. By clicking in one of the genes, users are sent to the gene view interface.
The gene view interface is the main WAVe workspace. The layout is divided in two main areas: the sidebar and the content zone. The sidebar displays minimal gene information on top – gene HGNC symbol, name and locus – and the navigation tree, which is WAVe’s user interface key element, at the bottom. The navigation tree is organized in nodes, each referring to a distinct data type: each node leaf links directly to a page containing information regarding a specific topic. Pages linked in each leaf appear in the content zone. This enables loading external applications without leaving WAVe’s interface and, thus, without losing focus with ongoing research.
Programmatic access to data is also available. The gene tree is available as an easily-parsable feed. Feeds are obtained by appending the atom tag (or other format: rss, json) to the end of the gene view address. For instance, BRCA2 Atom feed is available at http://bioinformatics.ua.pt/WAVe/gene/BRCA2/atom .
WAVe also provides an RSS API for variant access. With this, you have programmable access to all available variants for a given gene. For instance, BRCA2 variants (from multiple LSDBs) are at http://bioinformatics.ua.pt/wave/variant/BRCA2/atom. In addition to the variant description, WAve points to the original LSDB containing the variant.
This WAVe makes WAVe the only platform capable of providing aggregated variant listings through both visual and programmable access.
We highly appreciate any feedback you can provide regarding WAVe and the genomic variation field. To do this, you can simply send an e-mail to firstname.lastname@example.org. Thank you.
Daniel Sobral, “Ensembl Regulation”
Ensembl is a world reference for vertebrate genome annotation, providing high quality annotation for more than 50 species. Particularly challenging is the annotation of non-coding functional regions of the genome. Ensembl Regulation aims at making Ensembl
a reference for the annotation of genomic features with a potential role in the transcriptional regulation of gene expression. Combining publicly available data from large projects like ENCODE and The Epigenomics Roadmap, we group overlapping areas of open chromatin and transcription factor binding to build a “best-guess” set of regulatory features, in a cell-aware manner. Finally, we also include histone-modification and polymerase data to generate cell-specific classifications for the regulatory regions. Taking advantage of the role of the EBI as part of the ENCODE data analysis group, we aim at bringing Ensembl to the forefront of the annotation of the regulatory genome.
Daniel Polónia, “An electronic market for teleradiology services”
José Paulo Lousado, “Pattern analysis on DNA primary structure”
Funding entity: FP7-ICT (STREP)
The overall objective of this project is the design, development and validation of a computerized system that exploits data from electronic healthcare records and biomedical databases for the early detection of adverse drug reactions.
The integration of heterogeneous data sources has been a fundamental problem in database research over the last two decades. The goal is to achieve better methods to combine data residing at different sources, under different schemas and with different formats in order to provide the user with a unified view of the data. Although simple in principle, due to several constrains, this is a very challenging task where both the academic and the commercial communities have been working and proposing several solutions that span a wide range of fields. However, the limitations found on most solutions reflect the difficulty to obtain a simple but comprehensive schema able to accommodate the heterogeneity of the biological domain while maintaining an acceptable level of performance: GeNS is our proposal towards solving this issue.
The Genomic Name Server can be either downloaded and installed on a local computer or accessed by Web Services. Please keep in mind that GeNS currently requires over 10 GB of disk space and this figure is likely to increase in the near future. Therefore, if disk space is a serious restriction you should consider using the available Web Services. We are currently using
Microsoft SQL Server 2008 but GeNS can be set up in any other DBMS.
The Web Services are now available here. Furthermore, a detailed description is also available here (Updated March 24).The Web Services API is in an early stage of development and, as such, users should bear in mind that certains problems may arise during it’s usage.
GeNS uses four distinct methods for gathering data from external databases: by Web Services, web crawlers, database connectors and finally by tabular files connectors. All of the recovered data is subsquently processed and synchronized to our database. Finally, the data can be accessed via Web Services or by downloading, installing and querying the data with SQL.
Currently, GeNS is importing data from four major databases: UniProt (SwissProt and TrEMBL), KEGG, EMBL – EBI and Entrez. Since these databases already incorporate data from third-party databases, we have over 460.000 unique genes, more than 100.000 biological relations and a hundred and forty distinct datatypes.
GeNS database was designed with simplicity and extensibility in mind; the following schema is a complete representation of the database.
The following files allow anyone to reproduce the obtained results regarding the cross-database low identifier coverage issue and the
performance testing queries. You will need a working copy of GeNS in order to use these scripts.
GeneBrowser is a web-based tool that, for a given list of genes, combines data from several public databases with visualisation and analysis methods to help identify the most relevant and common biological characteristics. The functionalities provided include the following: a central point with the most relevant biological information for each inserted gene; a list of the most related papers in PubMed and gene expression studies in ArrayExpress; and an extended approach to functional analysis applied to Gene Ontology, homologies, gene chromosomal localisation and pathways.
Although GeneBrowser can be used to answer many different biological questions, a particular question set was used to tune its development:
We highly appreciate any feedback you can provide regarding GeneBrowser. email@example.com. Thank you.
J. Arrais, J. Fernandes, J. Pereira and J. L. Oliveira, Exploring and identifying common biological traits in a set of genes, BMC Bioinformatics, BMC Bioinformatics 2010, 11:212 (link)
Dicoogle is an information retrieval system for medical images. It starts by indexing DICOM files and metadata, both locally and in distributed systems using a P2P communication framework. Upon this distributed index users can then search for exams or specific features using a free text interface.
QuExT (Query Expansion Tool) is a document indexing and retrieval application that obtains, from the MEDLINE database, a ranked list of publications that are most significant to a particular set of genes. Document retrieval and ranking are based on a concept-based methodology that broadens the resulting set of documents to include documents focusing on these gene-related concepts. Each gene in the input list is expanded to its various synonyms and to a network of biologically associated terms. Currently, the expansion is based on proteins, metabolic pathways and diseases (this last one only when the selected organism is Homo sapiens). The retrieved documents are ranked according to user-definable weights for each of these concept classes. By simply changing these weights, users can alter the order of the documents, allowing them to obtain for example, documents that are more focused on the metabolic pathways in which the initial genes are involved, rather than on the genes themselves.
QuExT receives as input a list of genes and a corresponding organism. The gene list can be typed into the input box or uploaded in a text file. Genes can be separated by commas or spaces. The organism to consider is selected from the drop-box menu. Figure 1 shows the query expansion procedure.
When the user submits the form, gene names or identifiers in the input are checked against a database and mapped to an internal identifier corresponding to the selected organism. Genes which are not found in the database are rejected from further analysis.
QuExT then creates an expanded query and searches a local index of the PubMed database for documents matching this query.
Query expansion is performed as follows: for each gene in the query, the algorithm obtains, from a term expansion table corresponding to the selected organism, all the alternative gene, protein, pathway and disease names corresponding to that gene’s internal ID. The full list of terms from all input genes is then accumulated in four separate query strings (one for each concept type). Each term obtained from expanding all genes is used to search the index.
QuExT runs four index searches using the four query strings obtained in the query expansion stage (one for each concept type). For each search, the documents that match the query and the corresponding scores are obtained. Resulting documents and corresponding scores are kept on separate lists, one for each concept class.
Notice that while the term expansion takes into account the selected organism, to avoid going from a gene in one organism to a related term in a different organism, this is not true for document retrieval. Since the indexing does not distinguish between different species referred in the articles, a search for a gene name in H. sapiens may return results referring to the same gene but in mice, for example.
Finally, the results from the document retrieval stage are assembled and documents are re-ranked in terms of the defined weights for each concept. The final score for document i is obtained as a weighted sum of the four concept-based scores:
S. Matos, J. P. Arrais, J. Maia-Rodrigues, J. L. Oliveira, “Concept-based query expansion for retrieving gene related publications from MEDLINE”, BMC Bioinformatics, Apr 28; 11:212, 2010.
NeoScreen is a bioinformatics software that helps diagnosis tasks in newborn screening programs. The application imports MS/MS raw data, and organizes and maintains all the information along the time in a database, providing a set of patterns that allow the detection of abnormalities in the blood samples. Is is been used, from 2005, to support the Portuguese Newborn Screning Program (http://www.diagnosticoprecoce.org/)
The introduction of the Tandem Mass Spectrometry (MS/MS) in neonatal screening laboratories has opened the way to innovative newborn screening analysis. With this technology the number of metabolic disorders that can be detected, from dried blood-spot species, increases significantly. However, the amount of information obtained with this technique and the pressure for quick and accurate diagnostics raises serious difficulties in the daily data analysis. To face this challenge we developed a software system, NeoScreen, which simplifies and allow speeding up newborn screening diagnostics.
In this view, the individuals are separated in several diagnostic categories, such as “very suspicious”, “suspicious”, “not suspicious”, etc. Some of these categories represent individuals with markers out of the established limits, but that are not associated with any known disease. In the right-side frame it is displayed the relevant information that was extracted and processed by the software for each individual, like: plate information, markers concentrations, and suspicious diseases.
MIND is a repository of microarray experiments that handles storage, management and analysis of microarray data. It is supported by an infrastructure prepared to integrate dynamically further functionalities (Quality Control assurance, data processing, data mining, visualization, reports, etc.).
The development of microarray technology has been phenomenal during the past years, and it is becoming a daily tool in many genomics research laboratories. However, the multi-step and data-intensive nature of this technology has created an unprecedented computational challenge. In fact, the full power of microarrays technology can only be achieved if researchers are able to efficiently store, analyse and share their results.
A LIMS (Laboratory Information Management System) is an database repository that allows to manage all the laboratorial data.
ANACONDA is a software package specially developed for the study of genes’ primary structure. It uses gene sequences downloaded from public databases, as FASTA and GenBank, and it applies a set of statistical and visualization methods in different ways, to reveal information about codon context, codon usage, nucleotide repeats within open reading frames (ORFeome) and others.
Genome sequencing is opening unprecedent ways for understanding how gene primary structure is organized. Two of the most studied open reading frame characteristics are codon usage and codon context.
Traditional methods used for codon usage and context analysis do not provide user-friendly tools to carry out detailed gene primary structure analysis at a genomic scale.
Codon usage tables, using absolute metric, are available in public databases for any sequenced gene or genome and freeware software for multivariate analysis (correspondence analysis) of codon and amino acid usage is also readily available, however sophisticated statistical and data visualization tools are clearly lacking.
We propose the usage of several statistical methods – contingency table analysis, residual analysis, multivariate analysis (cluster analysis) – to analyze the codon bias under various aspects (degree of association, contexts and clustering).
A cluster analysis tool allows also calculating similarities between two vectors of the contingency table. This technique is used to group lines and columns (codons) of the correlation matrix, allowing highlight global patterns in the genes.
The statistical tools that are incorporated in the system, for data clustering, residual analysis and histogram plotting of calculated indexes, allow reaching new conclusions on gene primary structure features at a genomic scale. We expect that the results obtained will permit identifying some general rules that govern codon context and codon usage in any genome. Additionally, the identification of genes containing expanded codons that arise as a consequence of erroneous DNA replications events will permit uncovering new genes associated to human disease.
In order to detect the impact of codon context bias (as well as the presence of rare codons) on coding sequences, ANACONDA has additional tools for sequence mapping. The layout for sequence include written information about the ORF and the sequence itself, in which the codons have been coloured with the same residual colour scale of the ORFeome map.
ANACONDA allows the user to work with more than one ORFeome at a time. This creates large data sets that are difficult to deal with, in particular when multiple comparisons are being performed.
Considering that vast number of ORFeomes can be analyzed simultaneously by ANACONDA, we have included extra tools to allow comparative studies.
he statistical tools that are incorporated in the system, for data clustering, residual analysis and histogram plotting of calculated indexes, allow reaching new conclusions on gene primary structure features at a genomic scale. We expect that the results obtained will permit identifying some general rules that govern codon context and codon usage in any genome.
Anaconda 2 is now available for download. It is freely available for fundamental research only.
The medical imaging digitalization and implementation of PACS (Picture Archiving and Communication Systems) systems increases practitioner’s satisfaction through improved faster and ubiquitous access to image data. Besides, it reduces the logistic costs associated to the storage and management of image data and also increases the intra and inter institutional data portability. Echocardiography is a rather demanding medical imaging modality when regarded as digital source of visual information. The date rate and volume associated with a typical study poses several problems. They are hard to keep “online” (in centralized servers) and difficult to access (in real-time) outside the institutional broadband network infra-structure. For example, an uncompressed echocardiography study size can typically vary between 100 and 500Mbytes.
The innovation of our approach is the implementation of a DICOM private transfer syntax designed to support any video encoder installed on the operating system. This structure provides great flexibility concerning the selection of an encoder that best suits the specifics of a particular imaging modality or working scenario. To ultrasound studies we are using the highly efficient MPEG4 codec that takes full advantage of object texture, shape coding and inter-frame redundancy. More than 40.000 studies have been performed so far. For example, a typical Doopler color run (RGB) with an optimized time-acquisition (15-30 frames) and a sampling matrix (480*512), rarely exceed 200-300kB. Typical compression ratios can go from 65 for a single cardiac cycle sequence to 100 in multi- cycle sequences. With these averaged figures, even for a heavy work-loaded echolab, it is possible to have all historic procedures online or distribute them with reduced transfer time over the network, which is a very critical issue when dealing with costly or low bandwidth connections. The solution is actually installed in one public Central Hospital (CHVNG) and one private laboratory of cardiac images. Because the solution front-end is fully Web-based, the clinical specialists are using the platform to provide decision support remotely, accessing over Internet in a secure way (i.e. over SSL). Moreover, the solution is changing the work methods. The process workflow is fully digital where reviewing and reporting procedures can be done at physician’s home (i.e. telework).
Two studies were carried on assessing the DICOM cardiovascular ultrasound image quality. In a simultaneous and blind display of the original against the compressed cine-loops, 37% of the trials have selected the compressed sequence as the best image. This suggests that other factors related with viewing conditions are more likely to influence observer performance than the image compression itself.
Funding entity: FCT (PTDC/BIA-BCM/72251/2006)
Funding entity: FCT (PTDC/BIA-BCM/64745/2006)
Funding entity: FCT (PTDC/MAT/72974/2006)
DNA Microarray technology is one of the most promising new technologies for global gene expression analysis. This technology is sophisticated, very expensive, highly interdisciplinary and produces vast amounts of data whose management and analysis pose significant challenges. This project aims to study new bi-clustering approaches that can help to obtain relevant information from gene expression microarrays.
Funding entity: HFSP Research Grant
The very few quantitative mRNA mistranslation studies carried out to date indicate that the average decoding error ranges from 10-4 to 10-5 errors per codon decoded. However, no systematic study has yet been carried out to rank mRNA sequences according to
decoding error and no methodology has yet been developed to identify genes that are prone to decoding error.
In this project, software tools for data visualization and mathematical methodologies for identification of general rules governing RNA translation, and tools for mapping mRNA regions of high decoding error and for identifying putative gene expression regulatory sequences present in mRNAs, will be developed.
Funding entity: IST FP6 (IST2002-507585) – NoE (Network of Excelence)
There is a great potential for synergy between medical informatics and bioinformatics with a view on continuity and individualisation of healthcare, so that the benefits of the human genome sequence can reach the population. A collaborative effort between those two disciplines is needed to bridge the current gap between them. Biomedical Informatics (BMI) is an emerging discipline that aims at bringing these two worlds together to foster the development of novel diagnostic and therapeutic methodologies and strategies.
The INFOBIOMED network aims at setting a durable structure for the described collaborative approach at an European level, mobilising the critical mass and the resources necessary for enabling the collaborative approach that supports the consolidation of BMI as a crucial scientific discipline for future healthcare.
Funding entity: IST FP5 (IST2001-39013)
UA/IEETA was the Project Coordinator
One goal currently challenging bio – and clinical informatics is to develop robust computational methods and tools to model, store, retrieve and analyse information at multiple levels of complexity, i.e., from molecule to organism. For example, the unification of heterogeneous databases under one virtual system is an important step towards developing such robust computational models. The latter is the objective of the INFOGENMED project which aims at building a virtual laboratory for accessing and integrating genetic and medical information for health applications. Once built, the system allows practitioners, biologists, chemists and other experts to navigate through local and remote biomedical databases.
INFOGENMED started in September 2002, (http://www.infogenmed.net), and the functionalities already built in the system allow for: (1) defining clinical pathways to guide the user in the navigation of multiple sources over the Internet; (2) identifying and characterizing the most relevant databases to support the molecular medicine practice for selected rare genetic diseases; (3) designing the integration methods, based on virtual databases, mediators and semantic vocabulary servers.
Talk from Florentino Fernández Riverola, Dpto. de Informática – Universidade de Vigo
Current research lines and projects of the “Next Generation Information Systems” group, from University of Vigo, in Orense
A 14ª Reunião da Sociedade Portuguesa de Genética Humana irá realizar-se nos dias 18, 19 e 20 de Novembro de 2010, em Coimbra, no Auditório da Fundação Bissaya Barreto (Bencanta).
Mais informação na página da SPGH.
The 10th IEEE International Conference on Information Technology and Applications in Biomedicine,
will be held in Corfu, Greece, November 2-5, 2010 at Aquis Corfu Holiday Palace.
The Xth Spanish Symposium on Bioinformatics (JBI2010) take place in October 27-29, 2010 in Torremolinos-Málaga, Spain. Co-organised by the National Institute of Bioinformatics-Spain and the Portuguese Bioinformatics Network and hosted by the University of Malaga (Spain).
Place: Aveiro, Portugal
Date: 28 Jan- 2 Feb 2010
Place: Aveiro, Portugal
Date: December 8-10, 2006
Place: Aveiro, Portugal
Date: November 10-11, 2005
Funding entity: POCTI-32030/2001
Biologists have been wondering for many years how organisms evolved highly accurate information maintenance, transfer and decoding machineries. In particular, how the astonishing translational decoding rate of 20 codons per second is achieved with an average error of 10-4 to 10-5 per codon decoded, and how does the ribosome maintain the reading frame. The tools to answer these questions are not yet available but the row DNA sequencing data is. To shed new light into this important question, we have developed a software package that simulates ribosome scanning and reading during mRNA translation. The software screens fully or partially sequenced genomes and determines the arrangement of any particular codon in relation to the others by simultaneously fixing P-site codons and “memorizing” E and A-site codons during each translocation cycle. In doing so, it builds a genome wide codon context map that allows for identification of potential error prone mRNA sequences and gene expression regulatory points.
In this project, the various tools already developed will be integrated into a single software package to allow for automated search, downloading and editing of row DNA sequence data. Software tools for data display and new mathematical methodologies for identification of general rules governing mRNA translation will be developed. New tools for mapping mRNA regions of high decoding error and putative gene expression regulatory sequences present in the mRNAs, will also be developed. Finally, a database and an Internet Home Page will be built for making the data available to the scientific community. These in silico studies will be complemented with in vivo experiments. For this, a multidisciplinary team including two computing engineers, two mathematicians, one physicist, one biochemist and one molecular biologist has been assembled. To our knowledge this is the first Portuguese multidisciplinary team set up for functional genomics and the only one actively engaged on the development of software tools and mathematical models for genome analysis. It is expected that this project will provide important new insight on the role of the translational machinery on genome evolution.
Funding entity: POCTI-32942/99
Candida albicans is an important human pathogen which exists as a commensal in at least 50% of the human population. It accounts for more than 60% of all fungal infections and is now the fourth most common form of septicaemia in Western hospitals with an associated mobidity between 30 and 50%. It is also a major cause for concern in HIV-infected populations where 84% of the patients develop oropharyngeal C.albicans colonisation and 55% develop clinical thrush. C. albicans pathogenesis is dependent upon a wide range of virulence factors, namely a myriad of morphogenesis associated factors, represents a major challenge to the elucidation of C. albicans pathogenesis at the molecular level through classic molecular and biochemical methodologies. The diploid nature of C. albicans, its alternative genetic code and its recalcitrance to genetic analysis, add extra difficulties to its study and to the development of new antifungals. However, the advent of new genetics and molecular technologies which allow for genome wide analysis is promising to alter the present situation.
This project aims at integrating classical genetics and biochemical approaches with newly developed, proteomics and bioinformatics methodologies to uncover new virulence factors associated to morphogenesis.
Software tools are been developed for management of biological data extracted from protein 2D-maps, for helping planning and following up experimental protocols and for data storing. Additionally, mathematical algorithms are also been developed for creating theoretical protein 2D-maps for comparative proteomics studies.
Funding entity: FCT PTDC/EIA-CCO/100541/2008
The objective of this project is to develop a query expansion and document ranking method specially aimed at obtaining, from the MEDLINE database, a ranked list of publications that are most significant to a set of genes.
Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!
Funding entity: FCT PTDC/EIA-EIA/104428/2008
The overall goal is to instantiate a new network connectivity concept for medical imaging data and services at inter-institutional level. This will turn large volumes of clinical information and analytical tools, actually “locked” in clinical units, into shared repositories and high-quality collaborative environments for medical applications, education and research.
IEETA explora potencialidades das TIC na detecção precoce de reacções adversas a medicamentos
Funding entity: FP7-Health (IP)
The GEN2PHEN project has the overall ambition of unifying human and model organism genetic variation databases, and doing this in such a way that the resulting holistic view of G2P data can be blended with all other biomedical database domains via one or more central genome browsers.
XS is a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. XS handles Ion Torrent, Roche-454, Illumina and ABI-SOLiD simulation sequencing types. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). XS was designed and implemented at IEETA, a research unit of the University of Aveiro, and is available for non-commercial use. For other uses, please send an email to firstname.lastname@example.org.
Pratas, D., Pinho, A. J., & Rodrigues, J. M. R. (2014). XS: a FASTQ read simulator. BMC research notes, 7(1), 40.
tar -vzxf XS.tar.gz
HighFCM is a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores). HighFCM was designed and implemented at IEETA, a research unit of the University of Aveiro, and is available for non-commercial use.
Diogo Pratas and Armando J. Pinho. “Exploring deep Markov models in genomic data compression using sequence pre-analysis”. Proc. of the European Signal Processing Conference, EUSIPCO 2014, Lisboa, Portugal, September 2014.
DOI: to add.
Interactome for the Human oral cavity
From birth, humans are subject to the colonization and invasion attempts of numerous microorganisms. Although in normal situations, contacting with microbes can support the shaping and development of our immune system, specific situations, such as stress or an unhealthy diet, can render us vulnerable to opportunistic pathogens.
Since the oral cavity is particularly exposed to the environment, it is an anatomic region prone to microbial invasion. Additionally, one of the requirements for bacterial colonization and cellular invasion is the establishment of protein-protein interactions (PPIs) with the host. With this in mind, we aim to develop a computational method for prediction of the oral human-microbial interactome.
Revealing the human-microbial interactome will allow further understanding of the mechanisms behind the onset of oral diseases. Additionally, this knowledge may give insight on key proteins involved in oral infections, which can be used for either diagnosis, as molecular biomarkers, or for treatment, as drug-targets.
GeCo is a method and tool designed for the compression and analysis of genomic data. As a compression tool, GeCo is able to provide additional compression gains over several top specific tools in different levels of redundancy. As an analysis tool, GeCo is able to determine absolute measures, namely for many distance computations, and local measures, such as the information content contained in each element, providing a way to quantify and locate specific genomic events. GeCo can afford individual compression and referential compression (conditional or conditional exclusive). The tool is memory adjustable, using hash-caches for the deepest context models, making possible to be run in modest computers.
Pinho, A.J.; Pratas, D.; Ferreira, P.J.S.G., “Bacteria DNA sequence compression using a mixture of finite-context models,” Statistical Signal Processing Workshop (SSP), 2011 IEEE , vol., no., pp.125,128, 28-30 June 2011.