Importing corpora

Corpora can be imported from several input formats. The list of currently supported formats is:

Each format has a inspection function in the polyglot.io submodule that will check that format of the specified directory or file matches the input format and return the appropriate parser.

These functions would be used as follows:

import polyglotdb.io as pgio

corpus_directory = '/path/to/directory'

parser = pgio.inspect_mfa(corpus_directory) # MFA output TextGrids

# OR

parser = pgio.inspect_fave(corpus_directory) # FAVE output TextGrids

# OR

parser = pgio.inspect_textgrid(corpus_directory)

# OR

parser = pgio.inspect_labbcat(corpus_directory)

# OR

parser = pgio.inspect_partitur(corpus_directory)

# OR

parser = pgio.inspect_timit(corpus_directory)

# OR

parser = pgio.inspect_buckeye(corpus_directory)

Note

For more technical detail on the inspect functions and the parser objects they return, see PolyglotDB I/O.

To import a corpus, the CorpusContext context manager has to be imported from polyglotdb:

from polyglotdb import CorpusContext

CorpusContext is the primary way through which corpora can be interacted with.

Before importing a corpus, you should ensure that a Neo4j server is running. Interacting with corpora requires submitting the connection details. The easiest way to do this is with a utility function ensure_local_database_running (see Interacting with a local Polyglot database for more information):

from polyglotdb.utils import ensure_local_database_running
from polyglotdb import CorpusConfig

with ensure_local_database_running('database_name') as connection_params:
   config = CorpusConfig('corpus_name', **connection_params)

The above config object contains all the configuration for the corpus.

To import a file into a corpus (in this case a TextGrid):

import polyglotdb.io as pgio

parser = pgio.inspect_textgrid('/path/to/textgrid.TextGrid')

with ensure_local_database_running('database_name') as connection_params:
   config = CorpusConfig('my_corpus', **connection_params)
   with CorpusContext(config) as c:
       c.load(parser, '/path/to/textgrid.TextGrid')

In the above code, the io module is imported and provides access to all the importing and exporting functions. For every format, there is an inspect function to generate a parser for that file and other ones that are formatted the same. In the case of a TextGrid, the parser has annotation types correspond to interval and point tiers. The inspect function tries to guess the relevant attributes of each tier.

Note

The discourse load function of Corpuscontext objects takes a parser as the first argument. Parsers contain an attribute annotation_types, which the user can modify to change how a corpus is imported. For most standard formats, including TextGrids from aligners, no modification is necessary.

All interaction with the databases is via the CorpusContext context manager. Further details on import arguments can be found in the API documentation.

Once the above code is run, corpora can be queried and explored.