Self-Sufficient Framework for Continuous Sign Language Recognition

1KAIST / Daejeon, South Korea. 2Hanyang University / Seoul, South Korea.

Our framework is intended to train sign language recognition models without additional annotations. Our model trained on weekly-labeled sign language datasets shows better or comparable performance compared to other methods using multiple modalities or additional knowledge.

Abstract

The goal of this work is to devise a self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses the issues of sign language recognition, including (1) the demand for complex features such as hands, face, and mouth for understanding and (2) the absence of frame level annotations.

To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without additional networks or annotations and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence label with the predicted sequence.

We demonstrate that our model achieves state-of-the-art performance among RGB-based methods experimentally on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency compared to the other approaches that use multi-modality or extra annotations.

Additional Experiments

We show more experimental results to support our framework's novelty.

Robustness Comparison

We show our framework's the robustness in a real world scenario by chainging scale and translation during the inference time. Futhermore, we show the failure cases of pose-detector in STMC [1], where the transformation (scale, translation) are adapted. Note that our framework requires only RGB modality.

Efficiency Analysis

We compare the computational complexity with the most recent multi-cue based method, STMC. Even though the pose detector of STMC is light-weight, it still induces the bottleneck in the inference time. We highlight that DFConv significantly reduces both FLOPs and inference time by pulling out the pose estimator. For reference, in out environment, extracting human keypoints with HRNet [2] from PHOENIX-2014 dataset [3] takes 2-3 GPU days. Note that we implement STMC due to the absence of the code.

Comparison of computational cost and inference time with STMC. (*): we directly take the reported results in the original STMC paper.

Generality of DPLR

To demonstrate the wide applicability of DPLR, we compare DPLR with other CSLR approaches using pseudo-labeling.

We implement FCN [4] and VAC [5], and replace the GFE and VA modules corresponding to the pseudo-labeling modules of each method with our DPLR module. For both methods, DPLR shows superior performance compared to previous pseudo-labeling modules: GFE and VA . Moreover, we highlight that DPLR further boosts the full version of VAC, achieving the WER of 21.6% in the Test split. This indicates that DPLR is complementary to the VA module in VAC.

Additional Qualitative Results

We visualize more qualitative examples.

Divide and Focus Convolution

The comparison of GradCAM [6] activation maps between Ours and VGG-11 backbone network. DFConv better highlights multiple individual elements (hands, faces) across the entire image area whereas VGG-11 [7] simply attends to only hands.

Gloss-level Sequence Prediction

To explain the effectiveness of DPLR, we show more qualitative results of gloss-level sequence predictions. Note that the extra network is not required for the Dense Pseudo-Labels (DPL), and as the classifier for DPLR is an auxiliary, it does not affect on the inference time, which is important factor to satisfy real-time operation.

Qualitative results of the predicted gloss sequences on the PHOENIX-2014 Test split. Each color blocks represent different glosses, and the hirizontal axis is the time axis.

Additional Discussion

Potential Societal Impact

Our proposed solution directly contributes to the development of a sign translation system by providing high-quality multi-cue aware visual features to modern sign translation models. Advanced sign interpretation technologies could help socially marginalized deaf people and improve accessibility in social infrastructures such as education and health, which hearing people take for granted. However, current available large-scale PHOENIX/T benchmarks, which are sourced in a specific domain (e.g., weather forecast) could bias the model towards the certain scenario in language and visual appearance, leading to potential miscommunication, which could affect the lives deaf people.

Limitation and Future Work

Although we have shown that both non-manual and manual expressions are simultaneously captured from a sign video, the limitation of our work would come from the assumption that non-manual expressions occur in upper region of a frame, and vice versa for the manual expressions. While we address the issue by introducing the adaptability of the division ratio r at test time, not only the position, but also variations in scale of the signer (e.g. due to distance from camera) could be introduced in practical scenarios. Future CSLR works should embrace such practical challenges so that the recognition system can be deployed in the real world with ease.

References

[1] Zhou, Hao, et al. "Spatial-temporal multi-cue network for continuous sign language recognition." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.

[2] Sun, Ke, et al. "Deep high-resolution representation learning for human pose estimation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019

[3] Koller, Oscar, Jens Forster, and Hermann Ney. "Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers." Computer Vision and Image Understanding 141 (2015): 108-125.

[4] Cheng, Ka Leong, et al. "Fully convolutional networks for continuous sign language recognition." European Conference on Computer Vision. Springer, Cham, 2020.

[5] Min, Yuecong, et al. "Visual alignment constraint for continuous sign language recognition." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[6] Selvaraju, Ramprasaath R., et al. "Grad-cam: Visual explanations from deep networks via gradient-based localization." Proceedings of the IEEE international conference on computer vision. 2017.

[7] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).