To simplify the data pre-processing needs for performing digital analysis on CBETA corpus, we have defined a simplified version of the markup rules, which removed the information irrelevant to the textual analysis, and simplified the text structure to make corpus being easier understood by people who are going to perform digital analysis. We transform most of the original CBETA XMLs to the new markup standard. The Text-analysis friendly CBETA data set has following features:
The markups are still compliant with TEI P5.
All of the unnecessary information from existing CBETA texts has been taken out, e.g. critical apparatus, markups for menu items, etc.
The representation of text structures has been simplified and unified, only the element with “type” attribute is allowed for representing the text structure.
Each textual block is wrapped with an element with a type attribute, which is used to distinguish text block with different types (prose, verse,dharani…).
Every un-displayable character is assigned a unique code point in the Unicode private use area.