Skip to main content

Lightweight image super-resolution based on mixer-based focal modulation network

Research Abstract

Recently, the Transformer-based models have shown strong performance in many natural language processing (NLP) and vision tasks. However, these transformer models have high computation costs, limiting their practical applications. Therefore, a lightweight model called a mixer focal modulation network (MFMN) is proposed in this paper for image super-resolution (SR). The concept of the MFMN model is based on integrating both the focal modulation and convolution mixer by designing a mixer focal modulation module (MFMM). The MFMM is built similarly to the transformer block but without the multi-head self-attention (MHSA) module, which reduces the computation overhead of the MHSA module. The design of the focal modulation has three elements. (i) Hierarchical contextualization, designed based on utilizing a stack of depth-wise convolutional layers for encoding visual contexts from short to long ranges. (ii) It has gated aggregation for selectively gathering contexts for each query token based on its content. (iii) Element-wise modulation or affine transformation for fusing the aggregated context into the query. Also, MFMM allows the MFMN to make spatial and channel mixing, which improves the SR performance. Experimental results in multiple benchmarks are made to show the superior performance of our model in speed against the state-of-the-art methods. Finally, our model achieved around 10x faster run time compared to the lightweight Swin Transformer image restoration (LWSwinIR) at the scale of x2.

Research Authors
Garas Gendy, Guanghui He & Nabil Sabor
Research Date
Research Journal
Signal, Image and Video Processing
Research Member
Research Publisher
Springer
Research Vol
19, No. 522
Research Year
2025