Liste

Representation Learning for Arbitrarily Long Richly Formatted Multimedia Documents

Lay summary

Im Rahmen des Projekts wollen wir Merkmale für das Dokumentenlayout auf Diskursebene sowie den reichhaltigen multimodalen Kontext einbeziehen, um generische Merkmalsrepräsentationen von Multimedia-Dokumenten vorzutrainieren.

Dies kann als eine Erweiterung bestehender vortrainierter Repräsentationen gesehen werden, bei der wir Multimedia-Dokumente als Graph-Datenstrukturen aus verschiedenen Satzprogrammen wie Latex, Word, HTML usw. modellieren und dann bestehende vortrainierte Repräsentationen von Text [Devlin et al. 2018], Bildern [Girshick 2015] usw. entsprechend den Elementen im Dokument kombinieren. Wir stützen uns auf die jüngsten Fortschritte im Bereich der neuronalen Graphen-Netzwerke [Wu et al. 2019], um die Beziehung zwischen verschiedenen Elementen von Multimedia-Dokumenten auszunutzen und Repräsentationen von Multimedia-Dokumenten zu lernen. Wenn wir erfolgreich sind, planen wir, den Nutzen unserer generischen, vortrainierten Repräsentation von Multimediadokumenten auch in anspruchsvollen Aufgaben wie der Extraktion von Multimedia-Informationen und der Beantwortung von Multimedia-Fragen zu untersuchen.

Abstract

In the recent years, the field of Natural Language Processing (NLP) has seen a paradigm shift with the adoption of large-scale self-supervised pre-trained representations [Dai and Le 2015,Peters et al. 2018, Devlin et al. 2018, inter alia]. These models are pre-trained on large corpora of text without labels as language models and can be fine-tuned on any downstream task. Pre-trained representations have led to significant improvements in many traditional NLP tasks and are now being adopted to model text data in a variety of domains such as biomedicine [Lee et al. 2020], scientific documents [Beltagy et al. 2019] and patents [Lee and Hsiang 2019].While the impact of pre-training in NLP is undeniable, current pre-training approaches have key shortcomings that prevent us from unleashing their true potential in the real world. In particular:1. With increased digitization, there has been a growth in the number of long text documents such as blogs, stories, books, etc. However, our existing pre-training approaches use a moving 'context window' and thus cannot be directly fine-tuned for settings which require us to model arbitrarily long text. 2. With the growing use of typesetting software, text today is often written and presented with due formatting using multiple modalities of data (text, images, tables, figures, etc.) to convey information much more easily. However, pre-trained representations ignore this rich multimodal document structure which could be helpful in many applications.There are many challenges in pre-training representations of long multimedia documents. Pre-training Transformer [Vaswani et al. 2017] based models such as BERT again from scratch on a dataset of long documents is computationally very expensive as the computational complexity of these models increases quadratically with the length of the document. Moreover, there are no computational studies of text layout features in formatted text and how they can be leveraged for AI problems. These features are in addition to more well-studied discourse features such as syntactic arrangement or rhetorical forms and thus, it is not straight-forward to extend pre-training models for text data to formatted documents.Multimedia documents typically contain multiple modalities of data such as images, tables, figures, plots, etc. in addition to text. However, there isn't much work on pre-training generic multimodal feature representations and it would be very expensive to pre-train fresh generic multimodal feature representations.In this research, we plan to incorporate discourse level features for document layout and rich multimodal context to pre-train generic feature representations of multimedia documents. This work can be seen as an augmentation of existing pre-trained representations where we model multimedia documents as graph data structures from various typesetting software such as Latex, Word, HTML, etc. and then combine existing pre-trained representations of text [Devlin et al. 2018], images [Girshick 2015], etc. corresponding to the elements in the document. We rely on recent advances in graph neural networks [Wu et al. 2019] to exploit the relationship between different elements of multimedia documents and learn representations of multimedia documents. We have already conducted some preliminary studies in this direction. For example, our ongoing work on modeling long text documents shows promising results and in our past work [Sachan et al. 2020] on harvesting structured subject knowledge of geometry from textbooks, we have shown that formatting features can be used to improve a strong information extraction system in that domain, and the discourse and text layout features provide information that is complementary to the lexical semantic information commonly used for information extraction. If successful, we plan to also explore the utility of our generic pre-trained multimedia document representation in challenging tasks such as multimedia information extraction and multimedia question answering.

Last updated:01.10.2021

SNSF
Project funding (Div. I-III)
Original data source 201009 i

Information Technology
Mathematics, Natural- and Engineering Sciences;Engineering Sciences

1 People

Prof.Mrinmaya Sachan

We help you find the perfect fit.

Lay summary

Abstract