Super-resolution for scene text images is a pre-processing of scene text recognition to improve recognition accuracy. This task aims to improve the visual quality of text regions in the images from low-resolution images. Although SR techniques have significantly improved with the recent development of deep learning, it is still challenging to reconstruct high-resolution images for wild images with irregular shapes, severe noise, and blurring. This is because CNN-based methods are based on local calculations and do not consider text-specific characteristics and these are unable to deal with irregular deformations, etc. In this paper, we propose a multi-task learning-based Attentional Feature Fusion Network (MAFF-Net) to reconstruct visually high-quality images from low-resolution images in real scenes. MAFF-Net consists of a reconstruction branch and a super-resolution branch, which are trained simultaneously to share complementary features of the reconstruction model, such as noise reduction and structural information of the text, using the feature representation transfer (FRT) module. In addition, the transformer module, equipped with a 2-D self-attention mechanism, is used to deal with irregular deformations of the text. Then, we attempt to improve the visual quality of the images with severe noise, blurring, and irregular deformations by fusing the attentional features of the different viewpoints of the FRT module and the transformer module, respectively. Experimental results on the benchmark TextZoom dataset show that the proposed method achieves competitive performance with state-of-the-art methods and proves its effectiveness, especially for challenging images.