DAQA - Preliminary analysis

DAQA - Preliminary analysis#

This is an exploratory data analysis of collected data from DAQA with a focus on people, education affliation and organisations. The work presented below form part of ACDE’s presentation at Data Futures for Architectural History and Cultural Heritage, a collaborative workshop held in November 2022 with DAQA and Curtin University Library. This analysis was followed up with a more detailled analysis which can found in the following page, DAQA: Extended Analysis.

First, we display the number of records for each entity as per the ACDEA.

Show code cell source Hide code cell source

# for data mgmt
import json
import pandas as pd
import numpy as np
from collections import Counter
from datetime import datetime
import requests, gzip, io, os, json
import ast

# for plotting
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
from matplotlib.colors import to_rgba

# for hypothesis testing
from scipy.stats import chi2_contingency
from scipy.stats import pareto

import warnings
warnings.filterwarnings("ignore")

# provide folder_name which contains uncompressed data i.e., csv and jsonl files
# only need to change this if you have already downloaded data
# otherwise data will be fetched from google drive
global folder_name
folder_name = 'data/local'

def fetch_small_data_from_github(fname):
    url = f"https://raw.githubusercontent.com/acd-engine/jupyterbook/master/data/analysis/{fname}"
    response = requests.get(url)
    rawdata = response.content.decode('utf-8')
    return pd.read_csv(io.StringIO(rawdata))

def fetch_date_suffix():
    url = f"https://raw.githubusercontent.com/acd-engine/jupyterbook/master/data/analysis/date_suffix"
    response = requests.get(url)
    rawdata = response.content.decode('utf-8')
    try: return rawdata[:12]
    except: return None

def check_if_csv_exists_in_folder(filename):
    try: return pd.read_csv(os.path.join(folder_name, filename), low_memory=False)
    except: return None

def fetch_data(filetype='csv', acdedata='organization'):
    filename = f'acde_{acdedata}_{fetch_date_suffix()}.{filetype}'

    # first check if the data exists in current directory
    data_from_path = check_if_csv_exists_in_folder(filename)
    if data_from_path is not None: return data_from_path

    urls = fetch_small_data_from_github('acde_data_gdrive_urls.csv')
    sharelink = urls[urls.data == acdedata][filetype].values[0]
    url = f'https://drive.google.com/u/0/uc?id={sharelink}&export=download&confirm=yes'

    response = requests.get(url)
    decompressed_data = gzip.decompress(response.content)
    decompressed_buffer = io.StringIO(decompressed_data.decode('utf-8'))

    try:
        if filetype == 'csv': df = pd.read_csv(decompressed_buffer, low_memory=False)
        else: df = [json.loads(jl) for jl in pd.read_json(decompressed_buffer, lines=True, orient='records')[0]]
        return pd.DataFrame(df)
    except: return None 

def fetch_all_DAQA_data():
    daqa_data_dict = dict()
    for entity in ['event', 'organization', 'person', 'place', 'recognition', 'resource', 'work']:
        daqa_this_entity = fetch_data(acdedata=entity)
        daqa_data_dict[entity] = daqa_this_entity[daqa_this_entity.data_source.str.contains('DAQA')]
    return daqa_data_dict

df_daqa_dict = fetch_all_DAQA_data() # 1 min if data is already downloaded

Number of records for each entity in DAQA dataset
_________________________________________________

event: 0
organization: 967
person: 1103
place: 1939
recognition: 27
resource: 7696
work: 2203

DAQA Persons#

There are 1103 person records in DAQA. The following bar charts show the proportion of certain characteristics of persons in DAQA. We list the main findings below:

When we consider null cases, we find that males make up almost two-thirds of the persons records.
Without null cases, we see a male-to-female ratio of 85:15.
Out of the 1103 persons, 912 are architects making up 83% of the total persons records.
We check for any categorical associations between gender and whether a person is an architect. Using a chi-square test of association, we find statistically significant results with a p-value of 0.02. This suggests that these two variables are not independent of each other, and that more males tend to be architects than females.
Lastly, we find that 60% of the persons recorded in DAQA practiced in Queensland.

_images/5102c280da6d9defc57e88a7fe6858b33c7907d82b14a096f71aa2d29cb8797b.png

_images/22cd0ca57ed2ad38d2f4c97766ff4ac6a62bb4362973dc915e1ef9945da613fc.png

_images/8f93fb28b56593218c45f15701debd0d259e50ff4b85caac77b8add794b03895.png

_images/9939f462aaff9d5c0b6c96f26570e1fd20bac40343f3af1dfce2040facc3b2c2.png

_images/268f017eaf12de6e88191a20d0b140fb00f6fdd3c0dbd57ae14362a463fbd522.png

Education experiences#

Next we analyse the educational affliations of persons in DAQA. First we clean the data by removing rows with missing data and tidying inconsistent values, and assess the summary statistics of the data.

We find that out of the 1103 persons in DAQA, only 163 person records (15%) have populated education fields. As there are 210 education records, this suggests that most persons have one completed education record. However, there are some cases where a person has four education records i.e., Robert Riddel.
There are 39 unique education institutions in the data, with the University of Queensland being the most common with 78 occurences.
There are 10 unique education qualifications in the data, with Diploma of Architecture being the most common with 84 occurences.
Over two-thirds of education records are associated with a Queensland education institution (163 records).

Show code cell source Hide code cell source

# clean qualification data
ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['BACHELOR OF ARCHITECTURE',
                                                'BA','B.ARCH','BArch hons']),
                                         'BArch',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['BA SCIENCE(BOTANY)',
                                                'BA Town Planning','BA Larch',
                                                'BA Design Studies',
                                                'BA Design','BA Design',
                                                'BAppSci','BA (?) Town Planning']),
                                         'Bachelor (Other)',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['Diploma','DIPLOMA',
                                                'DIP Arch','DIPLOMA OF ARCHITECTURE',
                                                'DipA']),
                                         'DipArch',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['GradDip Uni Planning',
                                                'GradDip Project Management',
                                                'Grad Dip Town Planning',
                                                'Gdip Urban Planning',
                                                'GradDip landscape','Grad Dip Landscape']),
                                         'GDip (Other)',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['Grad Dip']),
                                         'GDipArch',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['M.A.ARCHITECTURE']),
                                         'MArch',ee_df['organization.qualification'])


ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['MA Town Planning','MBA',
                                                'MA App Sc','MA Education',
                                                'MArts','MArts (Urban design)',
                                                'Masters in Art','M.Litt',
                                                'MA Urban Studies']),
                                         'Masters (Other)',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['PhD (hon)','PHD',]),
                                         'PhD',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['CERT','certificate',
                                                'CERTIFICATE OF ARCHITECTURE']),
                                         'CertArch',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['CERTIFICATE OF TOWN PLANNING',
                                                'Cert Art&Design']),
                                         'Cert (Other)',ee_df['organization.qualification'])

ee_df['organization.qualification'] = np.where(ee_df['organization.qualification']
                                         .isin(['CERTIFICATE OF TOWN PLANNING',
                                                'Cert Art&Design']),
                                         'Cert (Other)',ee_df['organization.qualification'])

topquals = pd.DataFrame(ee_df['organization.qualification'].value_counts())\
            .reset_index()\
            .rename({'index':'Qualification','organization.qualification':'Frequency'}, axis=1)\
            .head(10)['Qualification']\
            .values

ee_df['organization.qualification'] = np.where(~ee_df['organization.qualification'].isin(topquals), 
                                'Unknown',ee_df['organization.qualification'])

ee_df['organization.qualification2'] = ee_df['organization.qualification']
ee_df['organization.qualification2'] = np.where(ee_df['organization.qualification'].str.contains('B'),
                                          'Bachelor',ee_df['organization.qualification2'])
ee_df['organization.qualification2'] = np.where(ee_df['organization.qualification'].str.contains('M'),
                                          'Masters',ee_df['organization.qualification2'])
ee_df['organization.qualification2'] = np.where(ee_df['organization.qualification'].str.contains('GDip'),
                                          'Graduate Diploma',ee_df['organization.qualification2'])
ee_df['organization.qualification2'] = np.where(ee_df['organization.qualification'].str.contains('Cert'),
                                          'Certificate',ee_df['organization.qualification2'])
ee_df['organization.qualification2'] = np.where(ee_df['organization.qualification']=='DipArch',
                                          'Diploma',ee_df['organization.qualification2'])

Show code cell source Hide code cell source

# clean place/school data
ee_df['coverage_range.place'] = np.where(((ee_df['coverage_range.place'] == 'VIC') &
                                         (ee_df['organization.name'] == 'UQ')) |
                                         ((ee_df['coverage_range.place'] == 'MELBOURNE') &
                                         (ee_df['organization.name'] == 'UQ')) |
                                         (ee_df['coverage_range.place'] == 'Qld') |
                                         (ee_df['coverage_range.place'] == 'Qld') |
                                         (ee_df['coverage_range.place'] == 'qld') |
                                         (ee_df['coverage_range.place'] == 'UNI'),
                                         'QLD',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'MELBOURNE'),
                                         'VIC',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'SYDNEY') |
                                         (ee_df['coverage_range.place'] == 'Colle'),
                                         'NSW',ee_df['coverage_range.place'])
                                          
ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'USA') |
                                         (ee_df['coverage_range.place'] == 'New York') |
                                         (ee_df['coverage_range.place'] == 'CANADA'), 
                                'USA',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'AUCKLAND'), 
                                'NZ',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'Scotland'), 
                                'SCOTLAND',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'LONDON') | 
                                         (ee_df['coverage_range.place'] == 'SCOTLAND') |
                                         (ee_df['coverage_range.place'] == 'Indep') |
                                         (ee_df['coverage_range.place'] == 'England'), 
                                'UK',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'Vienna, Austria') |
                                         (ee_df['coverage_range.place'] == 'VIENNA') |
                                         (ee_df['coverage_range.place'] == 'Vienna') |
                                         (ee_df['coverage_range.place'] == 'Vienna Austria') |
                                         (ee_df['coverage_range.place'] == 'Vienna, AUSTRIA'), 
                                         'Other (Europe)',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'Norway') |
                                         (ee_df['coverage_range.place'] == 'MILAN') |
                                         (ee_df['coverage_range.place'] == 'Rome') |
                                         (ee_df['coverage_range.place'] == 'SLOVAKIA') |
                                         (ee_df['coverage_range.place'] == 'Hungary') |
                                         (ee_df['coverage_range.place'] == 'Cech') |
                                         (ee_df['coverage_range.place'] == 'GERMANY'), 
                                         'Other (Europe)',ee_df['coverage_range.place'])

ee_df['coverage_range.place'] = np.where((ee_df['coverage_range.place'] == 'GUADALAJARA') |
                                         (ee_df['coverage_range.place'] == 'SOUTH AFRICA') |
                                         (ee_df['coverage_range.place'] == 'INDIA'), 
                                         'Other (Rest of World)',ee_df['coverage_range.place'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'UoM'), #data entry error
                                'UQ',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'BCTC/UQ') |
                                (ee_df['organization.name'] == 'BCTC?UQ') |
                                (ee_df['organization.name'] == 'CTC') |
                                (ee_df['organization.name'] == 'BRISBANE CENTRAL TECHNICAL COLLEGE'), 
                                'BCTC',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'Sydney') |
                                (ee_df['organization.name'] == 'SYDNEY UNI') |
                                (ee_df['organization.name'] == 'Sydney University'), 
                                'USYD',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'Melb') |
                                (ee_df['organization.name'] == 'Melbourne') |
                                (ee_df['organization.name'] == 'UNI OF MELBOURNE') |
                                (ee_df['organization.name'] == 'MELBOURNE'), 
                                'UoM',ee_df['organization.name'])

ee_df['coverage_range.place'] = np.where((ee_df['organization.name'] == 'UoM'), 
                                'VIC',ee_df['coverage_range.place'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'SYDNEY TECHNICAL COLLEGE'), 
                                'STC',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'EAST SYDNEY TECH COLLEGE AND TOWN PLANNING'), 
                                'ESTC',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'Melb TC'), 
                                'MTC',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'HAVARD GRAD SCH OF DESIGN') |
                                (ee_df['organization.name'] == 'HAVARD UNI'), 
                                'Harvard',ee_df['organization.name'])

ee_df['organization.name'] = np.where((ee_df['organization.name'] == 'articled') |
                                (ee_df['organization.name'] == 'articles') | 
                                (ee_df['organization.name'] == 'Articled Pup') |
                                (ee_df['organization.name'] == 'Articled'), 
                                'Articled',ee_df['organization.name'])

# Summary statistics
# display(HTML(ee_df.describe().to_html()))
print('Summary statistics:')
display(ee_df.drop(['organization.type','coverage_range.date_range.date_end.year','organization.qualification2'],axis=1).describe())

Summary statistics:

	organization.name	organization.qualification	coverage_range.place	display_name
count	210	210	210	210
unique	39	10	8	163
top	UQ	DipArch	QLD	Robert Riddel
freq	78	84	153	4

How many qualifications do persons in the the DAQA hold?#

Upon further inspection, we find that most persons in DAQA have one qualification (79%), with 15% having two qualifications and 6% having three or more qualifications. Below we list the persons with three or more qualifications. We find that most of these persons have a PhD.

Person records with three education qualifications:

Balwant Saini (BArch, BArch, PhD)
Blair Wilson (BArch, DipArch, Graduate Diploma)
Gordon Holden (DipArch, Masters ,PhD)
Graham de Gruchy (BArch, Masters ,PhD)
Janet Conrad (BArch, Bachelor, Masters)
Karl Langer (CertArch, DipArch, PhD)
Peter Skinner (Bachelor, BArch, MArch)
Steven Szokolay (BArch, MArch, PhD)

Person records with four education qualifications:

Barbara van den Broek (DipArch, Graduate Diploma, Graduate Diploma, Masters)
Robert Riddel (DipArch, Masters, DipArch, PhD)

_images/5c9c349880ca2ee3de49dc8e92bf036ebf33a0dbe9551315d71c7f726cf0ee7d.png

Education qualification types#

Firstly, we plot the number of education records for each qualification type. We further aggregate the education qualifications into six broad categories. We find that most persons in DAQA have a Diploma (34% of records), followed by a Bachelor (29% of records). Next, we inspect the distribution of the location of the qualifications. Beyond Queensland, other locations with a high number of education records include New South Wales, Victoria, the United Kingdom and the United States. The last bar chart reveals the number of education records by education institution. We only show the top eight universities, with the University of Queensland being the most common.

It should be noted that BCTC, QIT and QUT refer to the same institution, but have changed names over time. AA refers to the Architectural Association School of Architecture (based in the UK) and STC refers to the Sydney Technical College.

_images/1285e0f65b8cbbb41b43db2a54954af0c5b64169bdb0257cff431acc587657ba.png

_images/debd27ee722679eb51ee5aef3345a0ac508909f51fe5e15665b23cb198b09383.png

_images/12653288d9252d0db999b9e13956118b4c0b79233a0259437deaf661ff3352f2.png

_images/2f7ab4cf56b9260a5d42ba07c10d7ef3f9fa579ca054adc7445411864bb55a46.png

Education over time#

We continue the analysis by visualising education records with respect to year. The histogram below highlights that most education records are from the 1960s and 1970s. We also plot the number of education records by year and education qualification.

_images/0b1f6885f3fbe852988c9e72fdf30e74af9dec66f1e5641338711ccfc897e586.png

_images/f8685b89c3d1ac2d24f69213b7397ba1d4918a6574691ffbe06091686919ce26.png

_images/8122d04087f83d11391dd48e3b911e8d243670741d8dbf0be6a35530ce87ac28.png

Education over time (only Queensland)#

Next we focus solely on education records from Queensland. A shift in the adoption of education qualification type is observed around the 1960s. This shift is particularly emphasised in the second visualisation where proportions of Bachelor and Diploma records are plotted. For further context, we also provide visualisations with respect to education institution and education qualification.

_images/a6c4358fe4904813a811a380c2dfbca4090bbbf2ba3ac6134dc5e9ac0dcbaf10.png

_images/f949acc9c36c522a95ace02e302b8a7407023424eb8a2b517516c4e61b5dee12.png

_images/1e8446ac72265e802ddcfd7bec1a29c6f1eae1a749077203e08f7ff5dcffc29e.png

_images/544dcda89956b035ada71f18c35d862727dc9c6b53ac3b2f4f05f15880c82f51.png

DAQA Organisations#

There are 967 organisation records in DAQA. The bar chart below shows the proportion of organisations by type, with the majority being architectural firms. We also plot the number of types of person-organisation relationships. We find that three-quarters of these relationships are employment-related.

            "firm"  "education"  "organisation"  "government"
_class_ori     907           39              15             6 

_images/1ddde2c2d17b0db76d6378985738f116d8fc33cd3e21299a16c4b7bf8b3687bc.png

There are 1575 person-organisation relation records.

_images/6a9a1388c202753514673d307ac45e8fda3e19094bfe96e7712cfbfb98b69e25.png

DAQA - Preliminary analysis

Contents

DAQA - Preliminary analysis#

DAQA Persons#

Education experiences#

How many qualifications do persons in the the DAQA hold?#

Education qualification types#

Education over time#

Education over time (only Queensland)#

DAQA Organisations#