The workshops will take place on Tuesday 4 June in the Monseigneur Sencie Institute (MSI) building. Please consult the descriptions below to find where your workshop will be located in this building. The registration desk will be located at the main entrance of the building. From there, the organizers and volunteers will point you towards your workshop room.
Full Day Workshop - Transkribus
Workshop location: MSI 01.20
Workshop conveners: Annemieke Romein and Bram Jacobs
9:00 – 12:00
Basic introduction to the Transkribus platform (no prior knowledge expected)
12:00 – 13:30
Self-catered lunch
13:30 – 16:30
Advanced use of the Transkribus platform (prior knowledge of Transkribus preferred)
Recognising handwritten texts through palaeography is a time-consuming task. Historical Handwritten texts can be a challenge to read for untrained eyes. Even when reading handwritten texts frequently, it takes a lot of time. With the EU funded tranScriptorium-project (2013-2015) and the READ-project (Recognition and Enrichment of Archival Documents – 2016–2019) a consortium worked on the creation of a tool – the later Transkribus – to automatically recognise handwritten texts. ‘Making digitised and digital sources available is increasingly becoming a core element in many research projects.’ (Romein et al. 2020, p. 294) Artificial Intelligence (AI) played and plays a significant role in ‘unlocking the past’ as Transkribus’ slogan now reads.
Transkribus is one of the tools that can be trained to do handwritten text recognition (HTR). It is one of the tools,that, thanks to its 175,000+ users, has gathered a massive amount of training material on their servers which allows the AI to ‘learn’ and ‘adapt’ to challenging handwriting and has decreased the amount of training data needed to create a model drastically. While in 2018 one needed around 60-80 pages of training material; the number has dropped to 30-50 pages (10.000 words) by the end of 2022. With the emergence of more basemodels the minimum of words can even be reduced to 6.000; expectations are that with the Transformer HTR-models the number might be much lower again.
Transkribus is – for most users – intuitive and straightforward and – very important – easy to use in their own web browser. It is therefore much easier to use within classes/ workshops and it is much easier to explain to students and volunteers alike.
Both workshop leaders are active users of Transkribus. Annemieke Romein is a researcher and was elected as honorary community director to the READ-COOP in 2023. She has been actively giving workshops and webinars since 2019. Bram Jacobs is affiliated with Transkribus (as an account manager Benelux) and uses Transkribus for a family project. During the training, the use of examples from various projects where they are involved in will give practical insights and understanding of the use within daily research-life. Transkribus will be presented from the users’ point of view to ease the use; the presence of an account manager is meant to answer particular questions about the practical and financial implication of Transkribus in your (future) projects.
This workshop consists of two parts. You can attend both parts, but you can also participate in just one of them,please have a thorough look at the program to fit your needs. The workshop will be a hands-on, data-driven workshop. The organisers warmly welcome the participants to bring their own images/ scans/ photos of archival or library materials; this is not limited to handwritten material as printed texts are welcomed too. However, if you do not have source material available or are not (yet) at liberty to upload them to the servers in Innsbruck, the organisers can provide you with some material to practice on.
More details will be communicated to the registered participants closer to the date of the workshop.
Full Day Workshop - CLARIAH-VL
Workshop location: MSI 02.23
Workshop conveners: Lamyk Bekius, Julie Birkholz, Fien Danniau, Rein Debrulle, Lise Foket, Tom Gheldof, Frederic Lamsens, Vincent Neyt, Frederic Pietowski, Nooshin Shahidzadeh Asadi, Annamaria Van Ingelgem and Christophe Verbruggen
9:00 – 12:00
Named entity referencing and digital scholarly editing
12:00 – 13:30
Self-catered lunch
13:30 – 16:30
IIIF platform Madoc
In this in-person workshop we will give hands-on demonstrations on a number of tools developed in the CLARIAH-VL, a FWO infrastructure project. CLARIAH-VL is the Flemish contribution to the European research infrastructures DARIAH and CLARIN. The infrastructure brings together 22 research teams representing a range of disciplines from the universities of Ghent, Antwerp, Leuven and Brussels and the Dutch Language Institute. CLARIAH-VL supports the highly diverse and multilingual composition of digital humanities data inherent in European long term history, culture, environment and society. We work to achieve this through facilitating and (semi-)automating as many aspects of the workflows of humanities researchers as possible, to allow researchers to take full advantage of the most recent advances in the fields of machine learning, linked data and semantic technologies especially with regard to digital text and image analysis.
The full-day workshop will be organised in multiple parts with a session for each tool or service. Attendees will need their own laptop.
In this workshop we will present three tools: a named entity referencing toolkit, a tool for digital scholarly editing, and a participatory digital asset enrichment platform based on IIIF technology (listed in detail below).
Named Entity Referencing Toolkit
Tom Gheldof is the coordinator of CLARIAH-VL at KU Leuven’s Faculty of Arts since 2019. He is currently responsible for the Work Package on Linked Open Data and (as MA in Ancient History) works with several named entities such as historical place names (in gazetteers), persons, time, etc…
Frédéric Pietowski is a developer at KU Leuven. After completing his advanced Master’s degree in Digital Humanities in 2017, he became a member of the KU Leuven team. Initially, his work focused on several pilot projects, which have since evolved to become integral components of the Named Entity Referencing pipeline.
Toolkit for Digital Scholarly Editing
Lamyk Bekius is the University of Antwerp’s coordinator of the CLARIAH-VL. In 2023, she obtained her PhD at the University of Amsterdam and the University of Antwerp on the thesis ‘Behind the computer screens: the use of keystroke logging for genetic criticism applied to born-digital works of literature’, for which she also worked at the Huygens Institute (KNAW). Her research focuses on genetic criticism, born-digital literary archives, and keystroke logging.
Vincent Neyt is a researcher at the Centre for Manuscript Genetics at the University of Antwerp. He specialises in digital scholarly editions and is the technical developer of the Beckett Digital Manuscript Project as well as for the eXtant Toolkit for Digital Scholarly Editing (CLARIAH-VL). He is also writing a PhD on how Stephen King wrote IT.
Nooshin Shahidzadeh Asadi is a doctoral student at the University of Antwerp. She has been part of the CATCH 2020 and CLARIAH-VL projects in the process of her PhD and has been developing web-based tools to facilitate the work of digital scholarly editors who use HTR platforms and work with XML. She has a background in software engineering.
IIIF platform Madoc
Lise Foket holds a MA in History (Ghent University, 2020) and an Advanced Master in Digital Humanities (KULeuven, 2021). She joined the team of GhentCDH in 2021 as a research collaborator for the IIIF annotation and crowdsourcing platform Madoc, the geotemporal platform for digital heritage collections Gent Gemapt, and the educational innovation project Omeka-FLWI. Her personal research interests as an historian include animal history, social history and ecological history.
Fien Danniau is master in History (2005) and joined the GhentCDH team in 2018. As a public historian she focuses on collection management, digital storytelling, user interaction and education in Digital Humanities.
Rein Debrulle is a MA in History (Ghent University, 2022). He worked as a student on the ‘Gent Gemapt’-project in 2021 and joined the GhentCDH team in 2022. As a scientific collaborator, he participated in the review and management of textual and cartographic content on Gent Gemapt. His current work focuses on the IIIF-standards, crowdsourcing and geospatial data.
Julie Birkholz is Assistant Professor Digital Humanities at UGent and Lead of the Royal Library of Belgium’s Digital Research Lab, and the general coordinator of CLARIAH-VL.
Annamaria Van Ingelgem is a master in History (Ghent University, 2023) and currently finishing the Advanced Master in Digital Humanities (KULeuven). Research interests include public history, IIIF storytelling and Omeka S.
The CLARIAH-VL Named Entity Referencing Toolkit aids humanities researchers to extract, model and contextualise nameable entities (e.g. places, persons, organisations, etc…) from texts, relying on the Wikidata model and identifiers. We will demonstrate the toolkit – introduce the basic functionalities (e.g., Named Entity Linking) and the data model, data interpretation and reporting, as well as demonstrate the data export & import functions. This will be done using the HIPE-datasets.
eXtant offers a range of tools to assist with digital scholarly editing tasks of analogue and born-digital material. We will introduce the tools currently available in the toolkit, including the Writer’s Library App (for creating and publishing an edition of a collection of books), Diff Annotator (to enrich text-comparisons of two plain text files), Keystroke Loxensis (to visualise keystroke logging data encoded in TEI-XML), and Axolotl (a collaborative XML editor for HTR postprocessing). After introducing the different tools, we will provide a collaborative hands-on demonstration of Axolotl.
Madoc is a participatory digital asset enrichment platform based on IIIF technology. It is built on an Omeka S based Open Source platform for the display, enrichment, and curation of IIIF-based digital objects. The platform runs on open source services and technology, bound together to provide a single management interface for IIIF collections. We will demonstrate how you can use this collaborative IIIF platform for transcription, image segmentation, metadata enrichment or harvesting, and annotation – entity caption via rectangles or polygons through the lens of the Gent Gemapt project.
More details will be communicated to the registered participants closer to the date of the workshop.
Morning Workshop - Making Scholarly Collections of Letters Insightful and Accessible for a Wider Audience
Workshop location: MSI 01.23
Workshop conveners: Judith Brouwer, Maria Eskevich, and Bente Frissen
9:00 – 12:00
This half-day workshop includes a presentation from an invited speaker, break-out sessions focused on topics such as data enrichment and data visualization, and a discussion moment.
This workshop invites the audience of DH Benelux 2024 to a discussion about access and use of the scholarly data/collections by the wider audience(s). In particular, the organisers focus on letters as the type of material.
First, Dirk van Miert (director of Huygens Institute) introduces the topic and the context of previously carried out projects, such as CEN. Second, Judith Brouwer and Bente Frissen (WaU-programme at Huygens Institute) provide an overview of the collections available at the Huygens Institute and other organisations such as the Digital Library of Dutch Literature, the Bodleian Library and the Humanities Division of the University of Oxford, and projects and initiatives such as EMLO, LetterSampo, CorrespSearch. Third, our keynote speaker Andrew Payne (National Archives UK) showcases how letters could be used in the context of education.
Afterwards, we discuss potential data processing that would simplify, encourage, and boost the engagement with the collections by the public in three brainstorming sessions led by Kay Pepping (Huygens Institute), Monika Barget (Maastricht University), and Maria Eskevich (WaU-programme at Huygens Institute).
The insights from the workshop will feed the follow up work on the ‘brievenportaal’ (portal dedicated to letters) at the Huygens Institute that would encompass the datasets available at the institution and provide a broader perspective.
This workshop is organised in the context of implementation of the Dutch government- wide programme Werk aan Uitvoering (WaU) at the Huygens Institute1. Overall, the WaU programme aims to improve public services, and as part of this programme, the Huygens Institute will make its numerous and extensive digital information sources (online resources) on Dutch history and culture more accessible and connect them to other data collections. This kind of outreach plans require consultations with different target users, from researchers to general audience representatives, as well as potential collaborators together with whom it should be possible to improve services, and to align the way one uses state-of-the-art technologies to enrich metadata on a large scale.
On the one hand, the impact of the datasets managed by the Huygens Institute can be increased through their release in standard formats and overall improved interoperability. As these steps allow data processing with different analysis and visualisation tools, researchers can thus more easily use the datasets outside their own context, and link the data to other databases. In this way, we encourage new collaborations and research in the field and beyond.
On the other hand, datasets should always be considered in the context of the time in which they were created and annotated, thus also considering the scientific and societal framework in which the research questions of the past were posed. It is important that all those aspects that could potentially introduce some kind of bias are directly named and openly described in order to ensure well informed usage of the data. In this way, we lay the foundations for a comprehensive, open corpus of texts and data that can be explored in context. This enables a more open and inclusive approach to Dutch history, in line with the mission of the Huygens Institute.
The datasets under the management of the Huygens Institute are a reliable source of information, and therefore, constitute a valuable asset for society at large. In practice, we envisage that our work on thematic interfaces will make it easier for the general public to discover, explore and further utilise the data in new contexts.
The discussion about elements for a WaU strategy to follow will be the focus of this workshop. In particular, we would like to zoom into the following questions: i) what can and should be done with the (meta)data using the tools and currently available technology; ii) which ways to provide an insight into the collection overview and its content have proved to work better so far; iii) output of which lines of research activities could be of more interest for which potential user group, and thus might be prioritised?
More details will be communicated to the registered participants closer to the date of the workshop.
Afternoon Workshop - Experience and Challenges with Named Entities
Workshop location: MSI 01.23
Workshop conveners: Chiara Palladino, Margherita Fantoli, Evelien de Graaf, Monica Berti, Matteo Romanello, Tariq Yousef, Marijke Beersmans, Tom Gheldof, Laura Soffiantini, and Eleonora Litta
13:30 – 16:30
The focus of the workshop is to discuss the current definition of Named Entities in premodern texts and their annotation in a Digital Humanities context, including the design of gold standards, tagsets, and best practices.
The goal of the workshop is to assess the performance of currently existing tagsets and guidelines for the annotation of Named Entities in the domain of premodern texts and languages.
Named Entity Recognition (NER) is a core operation in NLP and one of the fundamental aspects in automatic information retrieval. It is tremendously useful in many areas of literary and historical exegesis, by providing essential contextual information on people, places and other relevant entities, and by allowing large-scale types of analysis, such as social and family networks, spatial footprint of a source, geographical simulations and mapping, patterns of movement and of transmission of ideas.
Premodern sources, including Classical sources like Ancient Greek literary works, but also early modern itineraries, Medieval sources, and even historical commentaries, all lack an adequate infrastructure for the training of automatic NER models. The recent innovations in the field of transformer-based language models, such as BERT, offer an important opportunity to improve the performance of NER methods in low-resourced languages (Ehrmann et al., 2022), but these models are very data-hungry and require training and evaluation datasets in order to perform optimally.
Named Entities are a particularly complex domain because they are often difficult to define: their boundaries are not always clear, and there are additional issues with historical uncertainty, OCR and spelling variation noise (Burns 2023), nested entities (see for instance Chastang et al. and 2021 and Torres Aguilar 2022), and metonymic uses (e.g. group names used as proxy for locations). Moreover, in Digital Humanities there is a substantial lack of best practices in the design of Named Entities datasets (for a recent attempt on secondary literature, cf. Romanello and Najem-Meyer 2022): current DH projects tend to adopt internal strategies that create issues with data exchange and introduce noise when it comes to training models. The lack of shared tagsets and guidelines makes the automation of NER tasks even more complicated.
Scholars who want to start annotating a corpus within their research venture might be faced with a lack of guidance on the methodological level and the multiplicity of tools and formats available. With this workshop, we aim at providing researchers with a platform where they can get started with the annotation process, or, in case they are familiar with the task, exchange with other experts on the best practices to make their data as sharable and “reusable” as possible.
This workshop aims at addressing these challenges by bringing together scholars with interest and expertise in premodern Named Entities. We will organize the work around a shared annotation task using INCEpTION (https://inception-project.github.io/), providing a predefined tagset designed for premodern sources. Participants will be able to use their own corpus if they wish, or to choose among a series of texts that will be provided.
In the second part of the workshop, we will organize a discussion on the application of the tagset and its generalization to different linguistic and textual domains, on issues in the recognition and annotation of Named Entities, their classification, and other common problems like entity boundary, nested entities, and so on. The task will serve as a starting point to discuss current challenges in premodern documents, and to plan for shared best practices.
The workshop is addressed to scholars who wish to learn how to annotate Named Entities in premodern texts and requires minimal familiarity with existing platforms. Participants will gain an essential overview into the topic of Named Entities and will learn a generalizable annotation workflow with a customizable tagset. Moreover, the workshop will foster collaboration across experts in premodern traditions, who will be able to assess common challenges and ways to address them.
More details will be communicated to the registered participants closer to the date of the workshop.
Afternoon Workshop - Opening up Born-digital Data for Researchers
Workshop location: 02.15
Workshop conveners: Eveline Vlassenroot, Peter Mechant, Friedel Geeraert, and Yu Tao
13:30 – 16:30
This workshop aims to explore strategies, techniques, and methodologies to ensure the scientific exploitation and social valorization of born-digital archived (social media) content, thereby enabling new forms of data access. By convening experts from diverse disciplinary backgrounds, the workshop seeks to identify best practices, share insights, and facilitate new avenues for exploring archived social media. To achieve this we will present various insights and tools, such as a framework for Web Archival Literacy as a self-assessment questionnaire archival intelligence scale, followed by the solicitation of ideas and feedback from participants. To guide the discussions, the workshop will adopt the conceptual devices of ‘orientating,’ ‘auditing,’ and ‘constructing’ developed by Ogden & Maemura (2021). These devices, describing common research practices and associated challenges, will overlap during the workshop rather than being presented as a linear workflow or fixed set of practices.
In the ‘Orientating’ phase, dedicated to understanding the archive and its interface, emphasis will be placed on grasping the institutional collection development policies and strategies. This phase is crucial for comprehending how the archive is shaped and governed by its overarching goals and audience considerations. Particular attention will be given to exploring published collection development policies of relevant institutions, which serve as guiding frameworks for content selection. Additionally, discussions will revolve around the influence of these policies on shaping the archive’s content, and the potential implications for enhancing accessibility and engagement within the research community.
The ‘Auditing’ phase, centered on contextualizing data by tracing the history of collection practices and curation decisions, will address and discuss researcher needs in terms of accessing born-digital content.
The workshop will end with a ‘Constructing’ section (e.g., involving the selection and aggregation of data from sources across collections) in which researcher needs in terms of content enrichment and implementation for enrichment will be discussed.
The workshop is organised by research partners in the BELSPO funded KBR coordinated BelgicaWeb project. This research project aims to (1) investigate how to sustainably provide access to Belgium’s born-digital collections for both the public and researchers; (2) create born-digital collections by capturing social media and web content; (3) aggregate and enrich existing (meta)data at KBR using linked open data, controlled vocabularies, Natural Language Processing and other digital methods; (4) develop the necessary data infrastructure by selecting the best (open source) technologies and sharing (open access) information and building on the best practices from international networks and infrastructures (e.g. DARIAH-EU, IIPC, WARCnet, RESAW); (5) analyze the relevant legal frameworks (e.g. data exchange, copyright in the context of text and data mining, data protection and privacy rights and freedom of expression) and (6) promote and raise awareness about Belgium’s born-digital heritage.
More details will be communicated to the registered participants closer to the date of the workshop.