WO2023057850 - VISUAL SPEECH RECOGNITION BASED ON CONNECTIONIST TEMPORAL CLASSIFICATION LOSS
National phase entry:
Publication Number
WO/2023/057850
Publication Date
13.04.2023
International Application No.
PCT/IB2022/058984
International Filing Date
22.09.2022
Title **
[English]
VISUAL SPEECH RECOGNITION BASED ON CONNECTIONIST TEMPORAL CLASSIFICATION LOSS
[French]
RECONNAISSANCE VISUELLE DE LA PAROLE BASÉE SUR UN COÛT DE CLASSIFICATION TEMPORELLE CONNEXIONNISTE
Applicants **
SONY GROUP CORPORATION
1-7-1 KONAN
MINATO-KU, Tokyo 108-0075, JP
Inventors
JIN, Shiwei
c/o SONY CORPORATION OF AMERICA
16535 VIA ESPRILLO, MZ 1029
San Diego, California 92127, US
LEE, JONG HWA
c/o SONY CORPORATION OF AMERICA
16535 VIA ESPRILLO, MZ 1029
San Diego, California 92127, US
WNUK, Matthew
c/o SONY CORPORATION OF AMERICA
16535 VIA ESPRILLO, MZ 1029
San Diego, California 92127, US
COSTELA, Francisco
c/o SONY CORPORATION OF AMERICA
16535 VIA ESPRILLO, MZ 1029
San Diego, California 92127, US
Priority Data
63/262,049
04.10.2021
US
17/689,270
08.03.2022
US
Application details
| Total Number of Claims/PCT | * |
| Number of Independent Claims | * |
| Number of Priorities | * |
| Number of Multi-Dependent Claims | * |
| Number of Drawings | * |
| Pages for Publication | * |
| Number of Pages with Drawings | * |
| Pages of Specification | * |
| * | |
| * | |
International Searching Authority |
EPO
* |
| Applicant's Legal Status |
Legal Entity
* |
| * | |
| * | |
| * | |
| * | |
| Entry into National Phase under |
Chapter I
* |
| Translation |
|
Recalculate
* The data is based on automatic recognition. Please verify and amend if necessary.
** IP-Coster compiles data from publicly available sources. If this data includes your personal information, you can contact us to request its removal.
Quotation for National Phase entry
| Country | Stages | Total | |
|---|---|---|---|
| China | Filing | 1350 | |
| EPO | Filing, Examination | 6343 | |
| Japan | Filing | 590 | |
| South Korea | Filing | 607 | |
| USA | Filing, Examination | 2710 |

Total: 11600 USD
The term for entry into the National Phase has expired. This quotation is for informational purposes only
Abstract[English]
An electronic apparatus and method for visual speech recognition based on connectionist temporal classification (CTC) loss is disclosed. The electronic apparatus receives a video that includes human speakers and generates a prediction corresponding to lip movements of the human speakers. The prediction is generated based on application of a Deep Neural Network (DNN) on the video and the DNN is trained using a CTC loss function. The electronic apparatus detects, based on the prediction, word boundaries in a sequence of characters that correspond to the lip movements and divides the video into a sequence of video clips based on the detection. Each video clip corresponds to a word spoken by the human speakers. The electronic apparatus generates a sequence of word predictions by processing the sequence of video clips and generates a sentence, or a phrase based on the generated sequence of word predictions.[French]
L'invention concerne un appareil électronique et un procédé de reconnaissance visuelle de la parole basés sur le coût de classification temporelle connexionniste (CTC). L'appareil électronique reçoit une vidéo qui comprend des locuteurs humains et génère une prédiction correspondant aux mouvements des lèvres de ces derniers. La prédiction est générée sur la base de l'application d'un réseau neuronal profond (RNP) sur la vidéo et le RNP est entraîné en utilisant une fonction de coût CTC. L'appareil électronique détecte, sur la base de la prédiction, les limites de mots dans une séquence de caractères qui correspondent aux mouvements des lèvres et divise la vidéo en une séquence de clips vidéo sur la base de la détection. Chaque clip vidéo correspond à un mot prononcé par les locuteurs humains. L'appareil électronique génère une séquence de prédictions de mots en traitant la séquence de clips vidéo et génère une phrase ou une expression basée sur la séquence générée de prédictions de mots.