IP-Coster | WO2023057850 | VISUAL SPEECH RECOGNITION BASED ON CONNECTIONIST TEMPORAL CLASSIFICATION LOSS

Publication Number WO/2023/057850

Publication Date 13.04.2023

International Application No. PCT/IB2022/058984

International Filing Date 22.09.2022

Title **

[English] VISUAL SPEECH RECOGNITION BASED ON CONNECTIONIST TEMPORAL CLASSIFICATION LOSS

[French] RECONNAISSANCE VISUELLE DE LA PAROLE BASÉE SUR UN COÛT DE CLASSIFICATION TEMPORELLE CONNEXIONNISTE

Applicants **

SONY GROUP CORPORATION 1-7-1 KONAN MINATO-KU, Tokyo 108-0075, JP

Inventors

JIN, Shiwei c/o SONY CORPORATION OF AMERICA 16535 VIA ESPRILLO, MZ 1029 San Diego, California 92127, US

LEE, JONG HWA c/o SONY CORPORATION OF AMERICA 16535 VIA ESPRILLO, MZ 1029 San Diego, California 92127, US

WNUK, Matthew c/o SONY CORPORATION OF AMERICA 16535 VIA ESPRILLO, MZ 1029 San Diego, California 92127, US

COSTELA, Francisco c/o SONY CORPORATION OF AMERICA 16535 VIA ESPRILLO, MZ 1029 San Diego, California 92127, US

Priority Data

63/262,049 04.10.2021 US

17/689,270 08.03.2022 US

Application details

Total Number of Claims/PCT	*
Number of Independent Claims	*
Number of Priorities	*
Number of Multi-Dependent Claims	*
Number of Drawings	*
Pages for Publication	*
Number of Pages with Drawings	*
Pages of Specification	*
Sequence Listing	*
International Search Report is established	*
International Searching Authority	EPO *
Applicant's Legal Status	Legal Entity *
Small Entity	*
Non-Commercial Organization	*
Small Entity, USA	*
Micro Entity, USA	*
Entry into National Phase under	Chapter I *
Translation

Recalculate

* The data is based on automatic recognition. Please verify and amend if necessary.

** IP-Coster compiles data from publicly available sources. If this data includes your personal information, you can contact us to request its removal.

Quotation for National Phase entry

Country	Stages	Total
China	Filing	1359
EPO	Filing, Examination	6470
Japan	Filing	591
South Korea	Filing	607
USA	Filing, Examination	2710

+ Add country

Total: 11737 USD

The term for entry into the National Phase has expired. This quotation is for informational purposes only

QUOTE TO EMAIL ONLINE QUOTE

Abstract[English] An electronic apparatus and method for visual speech recognition based on connectionist temporal classification (CTC) loss is disclosed. The electronic apparatus receives a video that includes human speakers and generates a prediction corresponding to lip movements of the human speakers. The prediction is generated based on application of a Deep Neural Network (DNN) on the video and the DNN is trained using a CTC loss function. The electronic apparatus detects, based on the prediction, word boundaries in a sequence of characters that correspond to the lip movements and divides the video into a sequence of video clips based on the detection. Each video clip corresponds to a word spoken by the human speakers. The electronic apparatus generates a sequence of word predictions by processing the sequence of video clips and generates a sentence, or a phrase based on the generated sequence of word predictions.[French] L'invention concerne un appareil électronique et un procédé de reconnaissance visuelle de la parole basés sur le coût de classification temporelle connexionniste (CTC). L'appareil électronique reçoit une vidéo qui comprend des locuteurs humains et génère une prédiction correspondant aux mouvements des lèvres de ces derniers. La prédiction est générée sur la base de l'application d'un réseau neuronal profond (RNP) sur la vidéo et le RNP est entraîné en utilisant une fonction de coût CTC. L'appareil électronique détecte, sur la base de la prédiction, les limites de mots dans une séquence de caractères qui correspondent aux mouvements des lèvres et divise la vidéo en une séquence de clips vidéo sur la base de la détection. Chaque clip vidéo correspond à un mot prononcé par les locuteurs humains. L'appareil électronique génère une séquence de prédictions de mots en traitant la séquence de clips vidéo et génère une phrase ou une expression basée sur la séquence générée de prédictions de mots.

WO2023057850 - VISUAL SPEECH RECOGNITION BASED ON CONNECTIONIST TEMPORAL CLASSIFICATION LOSS

Quotation for National Phase entry