Digital Archive of Queensland Architecture#

Introduction#


This digital archive of Queensland architecture offers a wealth of information on the region’s design history. The archive was originally constructed as part of the Hot Modernism project and focused on the period 1945-1975. We are now expanding to include earlier and later periods. The database comprises thousands of audiovisual and textual files that allow you to explore relationships between people, projects and firms.

The following sections will describe the data processing procedure that integrates DAQA data into the ACDE and the data sumary of DAQA.

Data Processing#


Data Extraction & Exploration#

The DAQA data needs to be scraped from its original website, which poses a challenge for developing a data model that can be easily integrated into the ACDEA while being as similar as possible to the original DAAO data schema. Efforts will be made to carefully design a data model that can capture all relevant information from the scraped data and effectively integrate it into the ACDEA without losing its original structure and meaning.

Upon exploring the Browsing tab on the original DAQA website, it was discovered that there are at least five key entities existing in DAQA: ARCHITECT, FIRM, PROJECT, ARTICLE, and INTERVIEW.

After parsing the DAQA website, several implicit APIs were found to be useful in web scraping.

  1. https://qldarch.net/ws/search?q={query_terms}&pc={page_count}&p={page_no}: When query_terms is *, the general records under all the document ids existing in DAQA can be retrieved.

  2. https://qldarch.net/ws/{query_category}/{doc_id}: This API can be used to retrieve the detailed record under the specific document id within the specific query category. The value of query_category can be media or archobj.

  3. https://qldarch.net/ws/media/download/{doc_id}: This API can be used to retrieve the resource under the specific document id within the media category.

  4. https://qldarch.net/ws/{query_type}: This API can be used to retrieve all detailed records under the specific query type, where the query_type values are the names listed in the Browsing tab on the original website.

By analysing all the general records using https://qldarch.net/ws/search?q=*, the DAQA records are classified into two categories: media and archobj. In the media category, there are 10 types of records including Photograph, LineDrawing, Image, Article, Audio, Transcript, Portrait, Youtube, Video, Spredsheet, while there are 13 types of records including structure, person, firm,article,interview, publication, topic, education, award, event, place, organisation, government in the archobj category.

The following charts, which were generated by the jupyter notebook DAQA_HierachySummary.ipynb, illustrates the hierachy of DAQA records.


To facilitate loading DAQA data into the ACDE, a scraping pipeline has been established. The process involves the following steps:

DAQA Web Scraping Workflow

  1. Setting up a MongoDB database called daqa_scraped.

  2. All Records Scraping: Scraping all general records by calling the API https://qldarch.net/ws/search?q=*, and storing them in the all_objects collection.

  3. Key Objects Scraping: Scraping all detailed records, including their relationship records, for the five key objects explicitly listed on the original website. These records are stored in their respective collections in the database. In particular, the geographical coordinates of project records are reverse-geocoded to obtain standardised geographic information for the project.

  4. Records Supplement: Supplementing records of all other objects not listed on the original website and extracting implicit records from the scraped data.

    1. Inserting all other archival objects that are not listed on the original website, but exist in the all_objects collection, into their respective collections in the database.

    2. Inserting all the media records into the media collection.

    3. Converting some fields of the records into corresponding relationship records and inserting them into the relationship collection.

    4. Adding unpublished records that couldn’t be found in the all_objects collection but exist in relationship records.

    5. Adding a tracking attribute called ori_dbid into the subject and object of relationship records in the relationship collection.

  5. Date Format Cleansing: Cleansing date formats and converting all date attributes into a standard JSON format consisting of fields for year, month, and day.

  6. New Fields Update: Updating new fields supplemented by experts at the University of Queensland.

Finally, the conceptual data model of scrapped DAQA data is shown as follows:

DAQA Conceptual Data Model

The schema of DAQA is shown as follows:

DAQA Schema

This pipeline ensures that all relevant records are scraped and loaded into the database in an organised and structured manner, ready for integration into the ACDE. For more details in web scraping of DAQA, please refer to this jupyter notebook DAQA_Scraping.ipynb.

Data Transformation & Loading#

As many preparations have been made in the scraping process, the transformation and loading of DAQA data are more straightforward by mapping corresponding entities/attributes to the entities/attributes in ACDEA. The related records of the corresponding original records are aggregated from the relationship entity and updated into the related attribute of the original records.

On an entity level, the DAQA entity projection is listed as follows:

DAQA Entity (Collection)

ACDEA Entity

person

person

structure

work

event

event

award

recognition

place, structure

place

government, organisation, firm

organisation

publication, article, interview, media

resource

relationship

relationship


On an attribute level, please find the details in the notes of the DAQA data dictionary. The data dictionary can be downloaded below.


For more details in data transforming and loading of DAQA, please refer to this jupyter notebook DAQA_Loading.ipynb.

Integration Data Report#


The following chart, which was generated by the jupyter notebook DAQA_IntegrationSummary.ipynb, illustrates the number of DAQA records before and after integration.

Analytical examples#


For examples of how to use the integrated DAQA data for analytical purposes, please refer to the following jupyter notebooks in the Data Analysis chapter of this book.

References#


  1. The Digital Archive of Queensland Architecture | Deborah van der Plaat - Academia.edu

  2. A Web 3.0 approach to building an online digital archive of architectural practice in post-war Queensland

  3. GitHub - uq-eresearch/qldarch-frontend

  4. GitHub - uq-eresearch/qldarch.backend: Digital Archive of Queensland Architecture