Download

1. Tool

Target: Train your models with your data using the automated tool.
Contains: Tool and library.
  Download Gimli 1.0.2

2. Java Library

Target: Train your models with your data in your Java project.
Contains: Library.
  Download Gimli 1.0.2 Library

Apache Maven:

Use Apache Maven in your project, in order to streamline the development, building and deployment processes. In order to use Gimli in your Maven project, the simplest way is to add both the repository and artifact dependencies into your POM file.

Repository dependency:

    <repository>
        <id>bioinformatics-all</id>
        <name>Public Bioinformatics Repository</name>
        <url>http://bioinformatics.ua.pt/maven/content/groups/public</url>
    </repository>
    

Artifact dependency:

    <dependency>
        <groupId>pt.ua.tm</groupId>
        <artifactId>gimli</artifactId>
        <version>1.0.2</version>
    </dependency>
    

3. Source Code

Target: Change, use and distribute.
Contains: Source code, tool, models and corpora.
  Download Gimli 1.0.2 Project

Git:

You can get the most recent version of Gimli by accessing our Git repository, which is stored on Github:   Go to Gimli Project on Github
After installing git on your machine, you can simply get the last version of Gimli by running the following command line:
git clone git://github.com/davidcampos/gimli.git

4. Resources

You can download models, corpora and external tools (GDep) independently: Download Models   Download Corpora   Download Tools

License

Gimli is Free!
 
Gimli is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.  Creative Commons License
Caution!
  1. You are free to copy, distribute, change and transmit Gimli.
  2. Gimli could not be used for commercial purposes.
  3. Gimli is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Install

Requirements

Installation

The installation process is quite simple:
  1. Extract downloaded zip file
  2. Open terminal
  3. Go to Gimli root folder
  4. Give the right execution permissions to Gimli executable: chmod u+x gimli.sh

Build

If you downloaded the source code of Gimli, you need Apache Maven to build the project. We provide two solutions to build Gimli from source code.

Command Line

  1. Download Apache Maven 3
  2. Install Apache Maven 3 following the installation instructions
  3. Open a terminal
  4. Check if Apache Maven is correctly installed by running the command: mvn -v
    • If it is correctly installed, you should get the version of Apache Maven. Otherwise, there was a problem installing Maven.
  5. Go to the root folder of Gimli
  6. Run the command: mvn package
    • The first time you run this command will take a while, since it will download the project dependencies.
  7. The generated jar files will be placed in the target folder.

Netbeans IDE

  1. Download Netbeans IDE
  2. Install Netbeans IDE by following the installation instructions
  3. Open Netbeans IDE
  4. Open Gimli project using Netbeans
  5. Build the project using Netbeans
    • The first time you build the project will take a while, since it will download the project dependencies.
  6. The generated jar files will be placed in the target folder.

Use

The usage of Gimli is quite simple and interactive.

You should use the -h or --help option in order to know the arguments of each feature provided by Gimli. For instance, type:

Performance

We present detailed results achieved by Gimli on two corpora, which were used in two different challenges:

BC2

  Protein
Precision 90,24
Recall 84,99
F-measure 87,54

JNLPBA

  Protein DNA RNA Cell Type Cell line Overall
Precision 72,52 74,95 70,34 82,08 63,91 73,96
Recall 78,33 65,44 70,34 63,20 58,80 72,17
F-measure 75,31 69,87 70,34 71,41 61,25 73,05

Workflow

The usage workflow of Gimli is divided in 3 simple steps:

In order to use Gimli and its features, you should follow the usage workflow:

Data format

Presently Gimli supports two input and output data formats. However, other data formats are already on testing procedures.
If you have interest in a specific data format, please contact us! It is easy to extend Gimli to support additional data formats.

BioCreative II Gene Mention (BC2)

For an annotated corpus, this format requires two files: Sentences, and Annotations.
For an unannotated corpus, only the Sentences file is required.

The sentences file should contain one sentence per line, which includes the unique identifier and respective sentence separated by a white space. The unique identifier should not contain white spaces.

The annotations file should contain one annotation per line, which follows the following format: SENTENCE_ID|FIRST_CHAR LAST_CHAR|TEXT
The character counting used for the FIRST_CHAR and LAST_CHAR, must be performed discarding white spaces.

For more information, please visit the challenge website:   Go to BioCreative

NLPBA Shared Task (JNLPBA)

This corpus should be provided using only one file, which already contains the abstracts, sentences, tokens and annotations.

The abstract must be identified by the respective MEDLINE identifier, using the format: ###MEDLINE:ID.

Each token is provided in one line, which contains the token and the respective label separated by a tab (\t). The labels should follow the BIO encoding format:

Since several entity types could be used in this format, each label ("B" or "I") should contain a suffix of the semantic type: -protein, -DNA, -RNA, -cell_type and -cell_line.

Each sentence is a block of tokens. Different sentences should be separated by an empty line.

For more information, please visit the challenge website:   Go to NLPBA

Convert

The goal of this module is to convert input data into a format that could be used by Gimli. During this process, Gimli uses GDep to support the tokenisation and linguistic feauture extraction.
Caution!
  1. The output is provided in gzip format.
  2. The GDep file must be provided in gzip format.
  3. The GDep argument is used with two goals:
    • If the file does not exist, Gimli parses the corpus using GDep and stores the result in the provided file.
    • If the file exists, Gimli uses the parsing result.

User

BC2

To convert a corpus in the BC2 format, you should use the command ./gimli.sh convert BC2. It requires the following arguments:

If the corpus is provided with sentences and annotations (train corpus), the conversion should be performed as following:

    ./gimli.sh convert BC2 -c sentences.txt -a annotations.txt -g gdep.gz -o corpus.gz
    

However, if you do not have annotations (test corpus), annotations should not be provided:

    ./gimli.sh convert BC2 -c sentences.txt -g gdep.gz -o corpus.gz
    

JNLPBA

To convert a corpus in the JNLPBA format, you should use the command ./gimli.sh convert JNLPBA. It requires the following arguments:

If the corpus is provided with annotations of several entity types (train corpus), you should specify the target semantic type of this conversion. Considering the JNLPBA corpus that contains 5 semantic types, you have to perform the conversion of this corpus 5 times, one for each entity type. Afterwards, you can train a CRF model focused on each semantic type.
To convert a corpus focused on a specific entity type:

    ./gimli.sh convert JNLPBA -c corpus.txt -e protein -g gdep.gz -o corpus.gz
    

If no annotations are provided (test corpus), you do not need to specify the target entity type:

    ./gimli.sh convert JNLPBA -c corpus.txt -g gdep.gz -o corpus.gz
    

Developer

The conversion of a corpus is quite simple, it follows 3 logical steps:
  • Create a corpus reader that "understands" the corpus format.
  • Read corpus into memory.
  • Alternatively, you can save the corpus into a file following the Gimli format, which already contains features extracted with GDep.
The main differences between the two code snippets are precisely highlighted.

BC2

// Input
String sentences = "sentences.txt";
String annotations = "annotations.txt";
String gdep = "gdep.gz";
String output= "corpus.gz";

// Create corpus reader
BCReader reader = new BCReader(sentences, annotations, gdep);
Corpus c = null;

// Read corpus using the BIO encoding format 
try {
    c = reader.read(LabelFormat.BIO);
} catch (GimliException ex) {
    logger.error("There was a problem reading the corpus.", ex);
    return;
}

// Write corpus to output file
try {
    c.writeToFile(output);
} catch (GimliException ex) {
    logger.error("There was a problem writing the corpus.", ex);
    return;
}

JNLPBA

// Input
String corpus = "corpus.txt";
EntityType entity = EntityType.protein;
String gdep = "gdep.gz";
String output= "corpus.gz";

// Create corpus reader
JNLPBAReader reader = new JNLPBAReader(corpus, gdep, entity);
Corpus c = null;

// Read corpus using the BIO encoding format 
try {
    c = reader.read(LabelFormat.BIO);
} catch (GimliException ex) {
    logger.error("There was a problem reading the corpus.", ex);
    return;
}

// Write corpus to output file
try {
    c.writeToFile(output);
} catch (GimliException ex) {
    logger.error("There was a problem writing the corpus.", ex);
    return;
}

Model

The Model module is used to train and test the performance of CRF models.
Caution!
  1. The output is provided in gzip format.
  2. The input corpus must be generated by the convert module of Gimli.
The definition of CRF models' characteristics is performed using a simple configuration file. Through it, you can specify the features that the model will use to be trained, and the order of the CRF. Gimli already provides several configuration files, collecting the best feature set for each entity on each corpus.
#Basic
token=1

#Linguistic
stem=0
lemma=1
pos=1
chunk=1
nlp=1

#Orthographic
capitalization=1
counting=1
symbols=1

#Morphological
ngrams=1
suffix=1
prefix=1
morphology=1

#Letters and Numbers
greek=1
roman=0

#Lexicons
prge=1
concepts=0
verbs=1

#Context
window=0
conjunctions=1

#CRF Order
order=2
Two different tasks can be performed using this module:

User

In order to run the Model, you have to run the command ./gimli.sh model. It receives the following arguments:

The two different tasks are performed as follows:
  • Train: ./gimli.sh model -t train.
  • Test: ./gimli.sh model -t test.
The train task will save the generated model in the provided model file.
./gimli.sh model -t train -c corpusTrain.gz -e protein -m crf.gz -f features.config -p fw
On the other hand, the test task will use the model in the file to annotate the corpus and check its performance.
./gimli.sh model -t test -c corpusTest.gz -e protein -m crf.gz -f features.config -p fw
You should use the verbose mode to get more feedback about the tasks that are being performed.
./gimli.sh model -t train -c corpusTrain.gz -e protein -m crf.gz -f features.config -p fw -v

Developer

Train

CRF models training follows 4 simple steps:
  • Load model characteristics, including features and order.
  • Load corpus
  • Train CRF Model
  • Alternatively, save model into a file.
// Input
String corpus = "corpus.gz";
String features = "bc2.config";
String model = "crf.gz";
Parsing parsing = Parsing.FW;
EntityType entity = EntityType.protein;

// Set defaults
LabelFormat format = LabelFormat.BIO;

// Load model configuration
ModelConfig mc = new ModelConfig(features);

// Load corpus
Corpus c = null;
try {
	c = new Corpus(format, entity, corpus);
} catch (GimliException ex) {
	logger.error("Problem loading the corpus from file.", ex);
	return;
}

// Train CRF Model
CRFModel m = new CRFModel(mc, parsing);
try {
    m.train(c);
} catch (GimliException ex) {
    logger.error("Problem training the model.", ex);
    return;
}

// Save model to file
try {
	m.writeToFile(model);
catch (GimliException ex) {
    logger.error("Problem saving the model.", ex);
    return;
}

Test

Testing the performance of CRF models follows 4 simple steps:
  • Load model characteristics, including features and order.
  • Load annotated corpus.
  • Load previously trained model.
  • Evaluate model performance in the annotated corpus.
// Input
String corpus = "corpus.gz";
String features = "bc.config";
String model = "crf.gz";
Parsing parsing = Parsing.FW;
EntityType entity = EntityType.protein;

// Set defaults
LabelFormat format = LabelFormat.BIO;

// Load model configuration
ModelConfig mc = new ModelConfig(features);

Corpus c = null;
try {
	c = new Corpus(format, entity, corpus);
} catch (GimliException ex) {
	logger.error("Problem loading the corpus from file.", ex);
	return;
}

// Test Model performance on corpus
try {
    CRFModel m = new CRFModel(mc, parsing, model);
    m.test(c);
}
catch (GimliException ex) {
    logger.error("Problem loading the model from file.", ex);
    return;
}

Annotate

The goal of this module is to annotate a corpus using one or various models. In the end, the produced annotations are stored in a file so you can use the specific challenge evaluation scripts.
When various models or entity types are used, Gimli takes advantage of a simple confidence-based combination algorithm.
Caution!
  1. The input corpus must be generated by the convert module of Gimli.
  2. The input model must be generated by the model module of Gimli.

User

Caution!
  1. You can provide multiple models under option -m. Follow the models format for each corpus and separate the various models with a white space.

BC2

To annotate a corpus and get the result in the BioCreative format, run the command ./gimli.sh annotate BC2.

One model

To annotate a corpus using only one model, run a command as follows.
./gimli.sh annotate BC2 -c corpusTest.gz -o output.txt -m crf.gz,fw,bc.config

Various models

If you provide several models, Gimli is going to combine the heterogeneous annotations and provide only one annotation for each chunk of text.
./gimli.sh annotate BC2 -c corpusTest.gz -o output.txt -m crf1.gz,fw,bc1.config \
														  crf2.gz,bw,bc2.config

JNLPBA

To annotate a corpus and get the result in the JNLPBA format, run the command ./gimli.sh annotate JNLPBA.

One model

To annotate a corpus using only one model and save the result in the JNLPBA format, run a command as follows.
./gimli.sh annotate JNLPBA -c corpusTest.gz -o output.txt -m crf_protein.gz,protein,fw,jnlpba_protein.config

Various models

When several models are provided, Gimli will combine the annotations. At first, it combines the models for the same entity, generating one corpus for each semantic type. In the end, it combines the several corpora of the different entity types to generate only one corpus and output file.
./gimli.sh annotate JNLPBA -c corpusTest.gz -o output.txt -m crf_protein1.gz,protein,fw,jnlpba_protein1.config \
															 crf_protein2.gz,protein,bw,jnlpba_protein2.config \
															 crf_dna.gz,DNA,fw,jnlpba_dna.config \
															 crf_RNA.gz,RNA,bw,jnlpba_rna.config

Developer

The annotation process requires 5 simple steps:
  • Load model characteristics, including features and order.
  • Load unannotated corpus.
  • Load previously trained CRF model.
  • Annotate corpus.
  • If desired, run post-processing algorithms to clean annotations problems.
The highlighted lines show the main differences between annotating a corpus using one or various models for the same entity type.

Annotate using one model

// Input
String corpus = "corpus.gz";
String model = "crf.gz";
String features = "bc.config"
Parsing parsing = Parsing.FW;
EntityType entity = EntityType.protein;
String output = "output.txt";

// Load model configuration
ModelConfig mc = new ModelConfig(features);

// Load Corpus
Corpus c = null;
try {
    c = new Corpus(LabelFormat.BIO, entity, corpus);
} catch (GimliException ex) {
    logger.error("There was a problem loading the corpus", ex);
    return;
}

// Load Model
CRFModel crfModel = null;
try {
    crfModel = new CRFModel(mc, parsing, model);
} catch (GimliException ex) {
    logger.error("There was a problem loading the model", ex);
    return;
}

// Annotate corpus
Annotator a = new Annotator(c);
a.annotate(crfModel);

// Post-processing
Parentheses.processRemoving(c);
Abbreviation.process(c);

Annotate using various models

// Input
String corpus = "corpus.gz";
String[] model = {"crf1.gz", "crf2.gz"};
String[] features = {"bc1.config", "bc2.config"}
Parsing[] parsing = {Parsing.FW, Parsing.BW};
EntityType entity = EntityType.protein;
String output = "output.txt";

// Load model configurations
ModelConfig[] mc = new ModelConfig[features.length];
for (int i = 0; i < features.length; i++) {
    mc[i] = new ModelConfig(features[i]);
}

// Load Corpus
Corpus c = null;
try {
    c = new Corpus(LabelFormat.BIO, entity, corpus);
} catch (GimliException ex) {
    logger.error("There was a problem loading the corpus", ex);
    return;
}

// Load Models
CRFModel[] crfmodels = new CRFModel[models.length];
try {
    for (int i = 0; i < models.length; i++) {
        crfmodels[i] = new CRFModel(mc[i], parsing[i], models[i]);
    }
} catch (GimliException ex) {
    logger.error("There was a problem loading the model(s)", ex);
    return;
}


// Annotate corpus
Annotator a = new Annotator(c);
a.annotate(crfmodels);

// Post-processing
Parentheses.processRemoving(c);
Abbreviation.process(c);

With version 1.0.1, on-demand annotation of raw sentences is much simpler:

String features = "config/bc.config";
String model = "resources/models/gimli/bc2gm_bw_o2.gz";
Parsing parsing = Parsing.FW;
EntityType entity = EntityType.protein;
LabelFormat format = LabelFormat.BIO;

// Load model configuration
ModelConfig mc = new ModelConfig(features);

try {
    // Get CRF model
    CRFModel crfModel = new CRFModel(mc, parsing, model);

    // GDepParser
    GDepParser parser = new GDepParser(true);
    parser.launch();

    Corpus corpus = new Corpus(format, entity);

    // Add sentences to corpus
    String sentenceText = "BRCA1 and BRCA2 are human genes that belong"
            + " to a class of genes known as tumor suppressors. "
            + "Mutation of these genes has been linked to hereditary "
            + "breast and ovarian cancer.";
    Sentence sentence = new Sentence(corpus);
    sentence.parse(parser, sentenceText);
    corpus.addSentence(sentence);

    // Annotate corpus
    Annotator annotator = new Annotator(corpus);
    annotator.annotate(crfModel);

    // Post-process removing annotations with odd number of parentheses
    Parentheses.processRemoving(corpus);

    // Post-process by adding abbreviation annotations
    Abbreviation.process(corpus);

    // Access lemmas, POS tags, chunks, dependency parsing results 
    // and annotations.
    for (Sentence s : corpus.getSentences()) {
        System.out.println(s.toExportFormat());
    }

    // Terminate parser
    parser.terminate();
} catch (GimliException ex) {
    logger.error("ERROR:", ex);
} catch (IOException ex) {
    logger.error("ERROR:", ex);
}
After annotating the corpus, it is necessary to save the result of this process into a file. As you know, this could be performed using two formats.

BC2

// Write to file in the BC2 format
BCWriter writer = new BCWriter();
try {
    writer.write(c, output);
} catch (GimliException ex) {
    logger.error("There was a problem writing the corpus to file", ex);
}

JNLPBA

This format supports various entity types in the same file. Thus, Gimli also supports the combination of several corpora with different entity types in one output file.

One corpus > One entity type

// Write to file in the JNLPBA format
JNLPBAWriter writer = new JNLPBAWriter();
try {
    writer.write(c, output);
} catch (GimliException ex) {
    logger.error("There was a problem writing the corpus to file", ex);
}

Various corpora > Various entity types

// Write to file in the JNLPBA format
JNLPBAWriter writer = new JNLPBAWriter();
try {
    writer.write(new Corpus[] {c_protein, c_dna}, output);
} catch (GimliException ex) {
    logger.error("There was a problem writing the corpus to file", ex);
}