04

Corpora

The project develops a multilingual, diachronic corpus architecture spanning multiple cultural traditions. The composition listed below is preliminary; the final scope will be determined in collaboration with domain specialists and is subject to digitization availability.

The guiding principle is the construction of quantitatively representative subsets for each tradition — sufficient in size and internal diversity to support cross-cultural comparison without privileging any single tradition as a normative center.

  1. Arabic and Islamic

    Taʿbīr manuals, Hadith collections, biographical dictionaries, and chronicles spanning Sunni, Shia, and related denominational traditions (7th century onward).

    Primary infrastructure: OpenITI / KITAB corpus; Al-Maktaba Al-Shamela

  2. Australian Aboriginal

    Dreaming narratives and dream accounts from the anthropological record (Spencer and Gillen through Stanner and Tonkinson); access and use subject to AIATSIS protocols and community consent across more than 250 distinct language groups.

  3. Chinese

    Dynastic histories, biji literature, oneiromantic manuals, and Buddhist and Daoist dream accounts (Tang through Qing).

    Primary infrastructure: Chinese Text Project; International Dunhuang Programme

  4. Indic

    Sanskrit, Pali, Tamil, and Tibetan traditions, spanning medical, epic, canonical, and contemplative literature.

    Primary infrastructure: GRETIL; Digital Corpus of Sanskrit (DCS); Buddhist Digital Resource Center (BDRC)

  5. Japanese

    Heian-period dream diaries, Buddhist omen-dream records, and oneiric poetry (Nara through Edo).

    Primary infrastructure: National Diet Library Digital Collections (NDL)

  6. Māori

    19th-century manuscript diaries in te reo Māori, the Journal of the Polynesian Society archive, and Elsdon Best’s ethnographic record.

    Primary infrastructure: Journal of the Polynesian Society digital archive

  7. Mesoamerican

    Nahuatl colonial documents, pictographic codices, and Maya and Aztec ethnographic and ritual records.

    Primary infrastructure: Foundation for the Advancement of Mesoamerican Studies (FAMSI)

  8. Middle Eastern and Ancient Near Eastern

    Mesopotamian dream incubation texts, Assyrian dream omen series (Iškar Zaqīqu), Egyptian oneiric papyri, and Hebrew Biblical and post-Biblical dream traditions.

    Primary infrastructure: Cuneiform Digital Library Initiative (CDLI); Thesaurus Linguae Aegyptiae (TLA)

  9. Slavic and Russian

    Pre-Christian oral tradition, Orthodox hagiographic literature, ethnographic folklore collections, and literary dream records (19th and early 20th centuries).

    Primary infrastructure: Prozhito (European University at St. Petersburg); Fundamental Digital Library of Russian Literature and Folklore (FEB-web); further sources pending collaboration agreements.

  10. South American

    Amazonian dream-soul traditions (Kagwahiv, Yanesha, Mehinaku, Shipibo, Ese Eja, and others); Andean Quechua and Aymara colonial and contemporary sources.

    Primary infrastructure: Archive of the Indigenous Languages of Latin America (AILLA)

  11. West African

    Yoruba, Akan, and Hausa traditions via colonial-era and post-colonial ethnographic corpora.

  12. Western

    Contemporary dream reports and specialised bibliographic sources combining exclusive access to the Dream Library Foundation collection (~15,000 bibliographic units spanning non-Western and Western traditions, rare and out-of-print scholarship, and unpublished dream journals) with the Sleep and Dream Database (SDDb, ~80,000 contemporary dream reports) and DreamBank.

    Primary infrastructure: Dream Library Foundation.

Underrepresented / future corpus development

Underrepresented traditions identified for future corpus development include Southeast Asian Buddhist literatures (Thai, Burmese, Cambodian), Persian and Zoroastrian oneiric texts, Indigenous North American traditions beyond the Mesoamerican record, and sub-Saharan East African oral traditions.

The multilingual and diachronic design is foundational rather than incidental. It requires language-specific NLP pipelines developed with domain specialists, transparent documentation of imbalances where digitization lags, and explicit acknowledgment that no single cultural tradition functions as the analytical center of comparison.