Overview
The typical workflow for the tool is as follows:
- Provide search terms for the tool to retrieve relavant articles and store the data in an XML file;
- The XML file will then be read by chunks to process the information in a structured format;
- Each article is processed in parallel and the information is stroed in a JSON file on disk.
Output
Example JSON output with dummy values:
[
{
"Article Title": "Clinical trial of X",
"Journal Title": "Cancer",
"PMID": "999999",
"Year": "2019",
"Relations": [
{
"Genes": [
"brca2",
"tp53"
],
"Stems": [
"associat"
],
"Sentence": "We demonsrate an associatiation between previously published SNPs affecting BRCA2 and TP43"
}
]
},
...
]
Usage
The tool expects either an XML file in the format used by Entrez for the PubMed database, or specific terms to search for and download PubMed articles itself. Therefore, to distinguish between the 2 modes, the tool should be called with one of these 2 keywords: local
or download
.
Download new files
Similarly, to download new articles, use the following format:
$ pangaea download "tp53"
By default, the tool downloads only the first 10 results. To change the limit pass the --number
argument:
$ pangaea download "mdm2 or tp53" --number 20000
For advanced search syntax, the tool uses PubMed functionality; more details here.
Existing XML file
The XML file may be generated using download-entrez. To pass an XML file to the tool, just type in the name of the file after the mode.
$ pangaea local tp53.xml
By default, the tool stores the output in a file named output.json
in the current directory. Pass the --output
flag to specify an output file:
$ pangaea local --output=output_tp53.xml tp53.xml
General help
For more information, please use:
$ pangaea --help
$ pangaea download --help
$ pangaea local --help
Performance
Workflow
At the moment, the gene extraction uses the following workflow:
- Use a sentence tokenizer to split the text into sentences
- At the moment, it's using Punkt Sentence Tokenizer as implemented in NLTK
- Replace "/" with " / "
- This is perfomred in order to treat words around the "/" as separate words
- Checks if within the current sentence there's a relation word
- This is done using the dictionary from the stems.csv file
- Eliminate all punctuation
- It is not clear whether it'd be better to keep dashes ("-")
- Convery sentence to lowercase
- Use a word tokenizer to split the sentence into words
- At the moment, it uses Treebank tokenizer as implemented in NLTK
- Remove stopwords
- Remove only stopwords that are not also gene names using this compiled list.
- Use Part-of-Speech tagger to tag each word with a PoS
- At the moment, it uses the Perceptron Tagger as implemented by NLTK
- Split the words in a sentence into n-grams
- At the moment,
N = 4
- At the moment,
- Skip any n-gram in which none of the words are part of the target tags
- The target tags at the moment are
['NNP', 'NN', 'JJ']
- The target tags at the moment are
- Loop through all possible combinations of the words in the current n-gram (a non-null powerset)
- Create a new word from each combination by concatenating the words in the combination
- Loop through all short gene names and detect if they are identical to the new word
- Short gene names at the moment are gene names under 6 characters
- Loop through all long gene names and detect if they are substrings of the new word
- If at least one gene name is found, store the details of the article where it has been found (article title, journal title, PMID, year, relations found)