Consistency Guided Diffusion Model with Neural Syntax
for Perceptual Image Compression

Haowei Kuang 1,2     Yiyang Ma 1    Wenhan Yang 3    Zongming Guo 1,2    Jiaying Liu 1

1 Wangxuan Institute of Computer Technology, Peking University

2 State Key Laboratory of Multimedia Information Processing, Peking University

3 Pengcheng Laboratory

Accepted by ACM MM 2024.

Abstract

Diffusion models show impressive performances in image generation with excellent perceptual quality. However, its tendency to introduce additional distortion prevents its direct application in image compression. To address the issue, this paper introduces a Consistency Guided Diffusion Model (CGDM) tailored for perceptual image compression, which integrates an end-to-end image compression model with a diffusion-based post-processing network, aiming to learn richer detail representations with less fidelity loss. In detail, the compression and post-processing networks are cascaded and a branch of consistency guided features is added to constrain the deviation in the diffusion process for better reconstruction quality. Furthermore, a Syntax driven Feature Fusion (SFF) module is constructed to take an extra ultra-low bitstream from the encoding end as input, guiding the adaptive fusion of information from the two branches. In addition, we design a globally uniform boundary control strategy with overlapped patches and adopt a continuous online optimization mode to improve both coding efficiency and global consistency. Extensive experiments validate the superiority of our method to existing perceptual compression techniques and the effectiveness of each component in our method.

Method

Figure 1. The entire framework of our proposed method. For an image 𝑥 to be encoded, we first perform lossy compression using a standard end-to-end image compression network, resulting in an output degraded image 𝑥~. Then, we extract a syntax vector from the original image 𝑥 using a syntax generator. This syntax vector is then used to guide the fusion of consistency features 𝑒 and diffusion features 𝑑 in a Consistency Guided Diffusion Model with Neural Syntax. After a complete diffusion process, we obtain a higher-quality reconstructed image 𝑥0. The consistent guidance architecture, neural syntax driven mechanism lead the diffusion model to stably reconstruct high-quality images, making the final output excellent in terms of perception and fidelity.

Results

Figure 2. Tradeoffs between bitrate (x-axes, in bpp) and different metrics (y-axes) for various models tested on Kodak and CLIC. We consider both perceptual (LPIPS, DISTS, FID) and distortion metrics (PSNR, VIF, MS-SSIM). The upper 2 rows (black frame) are the performance on Kodak datasets and the lower 2 rows (blue frame) are on CLIC professional dataset.

Figure 3. Visual comparisons with state-of-the-art methods on Kodak dataset. As can be seen, compared to the baseline used in our method (ILLM), we achieve a significant improvement in subjective performance at the cost of extremely low additional bitstreams.

Figure 4. Visual comparisons with state-of-the-art methods on CLIC and DIV2K dataset.


Resources

Citation

@inproceedings{
kuang2024consistency,
title={Consistency Guided Diffusion Model with Neural Syntax for Perceptual Image Compression},
author={Haowei Kuang and Yiyang Ma and Wenhan Yang and Zongming Guo and Jiaying Liu},
booktitle={ACM Multimedia 2024},
year={2024},
url={https://openreview.net/forum?id=nSUMQhITdd}
}