Skip to Main content Skip to Navigation
Conference papers

Fusion d'espaces de représentations multimodaux pour la reconnaissance du rôle du locuteur dans des documents télévisuels

Abstract : Person role recognition in video broadcasts consists in classifying people into roles such as anchor, journalist, guest, etc. Existing approaches mostly consider one modality, either audio (speaker role recognition) or image (shot role recognition), firstly because of the non-synchrony between both modalities, and secondly because of the lack of a video corpus annotated in both modalities. Deep Neural Networks (DNN) approaches offer the ability to learn simultaneously feature representations (embeddings) and classification functions. This paper presents a multimodal fusion of audio, text and image embeddings spaces for speaker role recognition in asynchronous data. Monomodal embeddings are trained on exogenous data and fine-tuned using a DNN on 70 hours of French Broadcasts corpus for the target task. Experiments on the REPERE corpus show the benefit of the embeddings level fusion compared to the monomodal embeddings systems and to the standard late fusion method.
Complete list of metadata

Cited literature [18 references]  Display  Hide  Download

https://hal-amu.archives-ouvertes.fr/hal-01454928
Contributor : Benoit Favre <>
Submitted on : Friday, November 6, 2020 - 11:48:45 AM
Last modification on : Thursday, February 11, 2021 - 8:20:08 AM
Long-term archiving on: : Sunday, February 7, 2021 - 6:47:52 PM

File

J62.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01454928, version 1

Citation

Sebastien Delecraz, Frédéric Béchet, Benoit Favre, Mickael Rouvier. Fusion d'espaces de représentations multimodaux pour la reconnaissance du rôle du locuteur dans des documents télévisuels. Actes de la conférence JEP 2016, Jul 2016, Paris, France. ⟨hal-01454928⟩

Share

Metrics

Record views

213

Files downloads

10