|
Powered by <TEI:TOK> |
Conversion from EAF to TEITOKThe searchable corpus in TEITOK of DoReCo was created automatically from the EAF files that are available for download via the Languages tab, combined with the metadata file for each language to provide metadata for each transcription file. There is a separate TEITOK corpus for each language, and each corpus contains both the core files and the extended files, with information in the metadata which part of the corpus they belong to. The TEITOK files are stored in the TEI/XML format, where the utterances are ordered chronologically, creating an interview-style transcription as opposed to the tier-based order in the EAF files. The corpus contains all the levels of the original tier files, but rather than working with dependent tiers, the TEI file has distinct nodes for each element: each text consists of utterances, with word units below each utterance, and morph units below the word units. The phone units are ignored in the TEITOK corpus since there is currently no manner to use them. All information was taken directly from the respective tiers, except for the written form of each word, which, for the corpus version, was added for convenience. However, it had to be automatically generated from the original full-sentence transcription text, since all other levels in the DoReCo EAF files only provide IPA transcriptions. The TEITOK corpus only uses the core tiers, since the supplementary tiers have not been verified in the DoReCo project. Not all languages provide information for all tiers, and sometimes less information is available for documents from the extended corpus. When available, the part-of-speech text is shown and made searchable, but the POS tags have not been normalized in the DoReCo project. |