Overview

Main code files:

  • controller.py

Contains the download_and_parse() entry point that triggers all the subsequent processes. Here, the CLI flags are first parsed and processed. If the tool is launched in download mode, then the download_pubmed function is first called to obtain the XML file. Otherwise, it continues with the provided XML file, instantiates the model indicated by the --model flag, and it is passed to the Parser object after which it calls the process_papers() method. Thus, the file controls the workflow and brings together the separate parts of the tool.

  • parser.py

Defines the Parser class which reads the XML file in chunks, spawns processes for parsing articles in parallel, and handles writing the results to disk.

  • models.py

Contains the models implemented for parsing abstracts. The `RelationsExtractor abstract class designates a template specifying the methods that should be implemented by models that inherit from it. Here RulesExtractor contains all functionality for parsing a piece of text (abstract, in this case) and return a dictionary that can be written to disk later.

  • utils.py

Contains several helper functions such as processing synonyms if required, fetching stopwords, and generate filenames.