DILA 2025

下載各種格式的 CBETA 全文

1. 由 CBETA API 下載各種格式全文

2. CBETA Basic Markup (BM)

CBETA 簡單標記版

3. CBETA XML P5a

CBETA 內部編輯使用的 XML 版本： https://github.com/cbeta-git/xml-p5a

4. CBETA XML P5

由 XML P5a 經程式轉出、供外部使用的版本： https://github.com/cbeta-org/xml-p5

5. Text-analysis friendly CBETA data set

The data set is available at https://github.com/DILA-edu/CBETA_TAFxml

To simplify the data pre-processing needs for performing digital analysis on CBETA corpus, we have defined a simplified version of the markup rules, which removed the information irrelevant to the textual analysis, and simplified the text structure to make corpus being easier understood by people who are going to perform digital analysis. We transform most of the original CBETA XMLs to the new markup standard. The Text-analysis friendly CBETA data set has following features:

The markups are still compliant with TEI P5.
All of the unnecessary information from existing CBETA texts has been taken out, e.g. critical apparatus, markups for menu items, etc.
The representation of text structures has been simplified and unified, only the element with “type” attribute is allowed for representing the text structure.
Each textual block is wrapped with an element with a type attribute, which is used to distinguish text block with different types (prose, verse,dharani…).
Every un-displayable character is assigned a unique code point in the Unicode private use area.

6. CBETA data in plain text format

The data set is available at https://github.com/DILA-edu/CBETA-txt

The data set is generated by transforming the XML file into plain text format.

Each directory represents a text in CBETA.
Inside the directory, the filename ends with _000.txt is the content of the whole text.
The filename ends with _NNN.txt is the content of NNN-th fascicle of the text.