A Hebrew textual corpus on construction, planning, and architecture

2022-2024

Funding agency: Israel Innovation Authority

Research leader: Or Aleksandrowicz

Project supervisors: Daniel Rosenberg, Omri Shafer-Raviv

Project advisors: Noam Ordan, Nick Howell

Project assistants: Dina El Qasem, Hodaya Saada, Mai Sabbah, Sherry-Atara Khasdan, Naama Koren, Shiran-Ester Shnaiderman

The construction industry is one of the main economic sectors in Israel and it is expected to maintain its central position in the coming decades in light of the country’s rapid population growth rate. Unlike many developed countries, where the rate of new construction is slow due to low rates of population growth, in Israel, the built-up area doubles every 25 years. The creation of a textual corpus in Hebrew on construction, planning, and architecture is expected to facilitate and expedite the development of NLP-based tools for application and assimilation in technological fields related to the construction industry.

The corpus consists of Hebrew documents from a wide variety of contemporary and historical sources, including legislative decrees, regulatory guidelines, research reports, academic studies, and professional journals. In the development of the corpus, we are using digitally born as well as scanned printed publications, which go through a process of optical character recognition (OCR), cleaning, and parsing. Parsing was performed using the Trankit Python Toolkit.

The corpus holds 22,382,594 words in 1218 documents.

The corpus is available for all types of uses for NLP research and development according to the CC BY 4.0 license (Attribution 4.0 International).

We wish to thank Vicky Davydov, Lena Avrahami and Shai Zack from the Library of the Faculty of Architecture and Town Planning (Technion), as well as Moti Yeger, Director of the Technion’s Central Library, and Prof. Rafael Sacks, Head of the National Building Research Institute, for the help they have been providing for the project since its inception.

Cite: Aleksandrowicz, O., Rosenberg, D., Shafer-Raviv, O., Ordan, N. (2024). Hebrew textual corpus on construction, planning, and architecture. GitHUB. https://github.com/bdar-lab/heb_architecture_corpus.