
Document Reformatting Using Multimodal AI Models for Printer / Scanner Edge Devices
Author(s): Anirban Bhattacharjee
Originally published on Towards AI.
In modern print and scan workflows document reformatting is a critical component, especially in environments dealing with diverse input formats, different languages, and layouts which are common in modern enterprise environments. Traditional rule-based algorithms often fall short in accurately interpreting and adapting such content. Here we explore how multimodal AI models can be used to perform intelligent document reformatting directly on printer devices. By integrating visual, textual, and layout understanding, multimodal models can transform complex documents into print-ready outputs with high fidelity. We further evaluate how to deploy these models efficiently on resource-constrained edge hardware typical in printer devices, leveraging model optimization and other techniques to balance performance with computational limitations.
The printing industry is undergoing a transformation with the advent of intelligent edge capabilities. Printers are no longer passive endpoints but active participants in content processing. A key challenge for printers is the ability to reformat documents, often received in a variety of inconsistent and unstructured formats β into clean, well-organized print-ready versions. This task becomes even more challenging when dealing with multilingual content, scanned documents, and unconventional layouts.
Traditional reformatting methods rely heavily on predefined rules or templates, which fail to scale across diverse document types. Recent advances in AI, particularly multimodal models that combine vision and language understanding, offer a promising solution to this problem.
Problem Statement
Rule-based systems break often and require significant manual effort to adapt to new document types. They lack ability to generalize and often break when encountering unseen layouts or languages. AI based systems can do better, however, traditionally most of the AI processing has happened on cloud. Cloud-based AI processing, while powerful, introduces privacy and latency concerns and is overdependent on availability of a high bandwidth network to function.
To address these challenges, a solution is proposed that embeds multimodal AI capabilities directly within the printer, enabling on-device document reformatting. This approach reduces reliance on external infrastructure and ensures faster, more secure processing.
Multimodal AI for Document Understanding
Visual Language Models a special category of Multimodal AI models integrates textual content, visual layout, and spatial structure to achieve deeper document comprehensions.

Different Visual Language Models can be used for different reformatting tasks β a few were evaluated:
β’ Qwen 2.5 VL: A multimodal model developed by Alibaba Cloud, capable of handling text, images, and other modalities
β’ Flux (diffusion models): Flux diffusion models are text-to-image generation models that utilize a hybrid architecture of multimodal and parallel diffusion transformer blocks. They are known for their ability to produce high-quality, detailed images while adhering closely to text prompts.
β’ LayoutLMv3: Combines text, image, and layout features for document understanding.
β’ Donut (Document Understanding Transformer): An OCR-free model that processes documents end-to-end using image and sequence modelling.
β’ Pix2Struct: Converts visual inputs into structured text outputs, useful for layout-to-structure transformation.
β’ TATR: TATR from Microsoft is an object detection model that recognizes tables from image input.
These models can identify document sections, extract relevant content, and reorganize it into a desired format with minimal supervision. The following models were selected β Qwen 2.5 VL model, Flex model and few others for document reformatting tasks for actual deployment.

The Qwen 2.5VL model excels in handling complex visual inputs, including images of varying sizes and extended-duration videos, while also maintaining strong linguistic performance.
Use Cases
Multiple use cases for printers and other similar workflows are made possible β few examples
β’ Extract Tabular data and reformat to graphs: Extract table data from pages and reformat it into plots and graphs and overwrite on the page to be printed
β’ Image generation and modification: Generate a greeting card with a prompt, change color of individual objects, change textual information, change border etc.
β’ Image text correction and text addition: Select text to be modified and correct same
β’ Invoice and Form Reformatting: Automatically restructuring scanned forms into standardized templates.
β’ Multilingual Content Handling: Supporting translation and reflow of documents in multiple languages.
β’ Accessibility Optimization: Adapting layout for visually impaired users by increasing font size, contrast, and simplifying design.
Data Processing Pipeline

This pipeline is executed entirely on-device, ensuring real time, low-latency and privacy-preserving processing.
β’ Input: In image or pdf format
β’ Input Acquisition: The printer captures or receives documents in image/PDF format.
β’ Preprocessing: Lightweight routines normalize resolution, segment pages, and apply noise reduction.
β’ Model Inference: A quantized multimodal model interprets content, identifies key elements, and predicts restructured layout.
β’ Postprocessing: Generates reflowed text, aligns formatting, and creates a print-ready layout
Deployment aspects on Resource-Constrained Devices
Edge printers typically operate with limited compute, memory, and storage. Example deployment configuration was x86 based processor and Tesla T4 GPU with memory of 16 GB. To support AI workloads where resource constraints are present on edge devices, the following strategies were used:
β’ Downscaling the image: During preprocessing step downscaling images, while reducing resolution, significantly improves image processing performance efficiency by reducing the amount of data that needs to be processed.
β’ Object localization and grounding: Some of the tasks such as accurately locating objects within images using bounding boxes and point coordinates for enhanced spatial reasoning could be accomplished using the abilities of the base Qwen model instead of custom pipeline implementations.
β’ Model Quantization: Reduces model size and accelerates inference with minimal loss in accuracy. For Qwen 2.5 VL (7 bn parameter) model 4-bit quantization and the GGUF format for the model was used. GGUF was introduced as a more efficient and flexible way of storing and using LLMs for inference and designed to perform well on consumer-grade computer hardware.
β’ Diffusion model hyperparameter optimization: Customizing the KSampler and Scheduler in diffusion models allowed for better performance by enabling greater control over the sampling process, optimizing efficiency, and improving the quality of generated images. By fine-tuning sampling algorithms and parameters, the model was tailored to specific needs, leading to more precise outputs.
β’ Edge Runtimes: The optimized library from NVIDIA-TensorRT for deployment on T4 GPU was utilized.
All the above steps helped in reducing both the inference time and memory footprints for deployment on typical printer SoCs.
Challenges and Mitigation Strategies
For real world deployment on printers few challenges will need to be solved
β’ Large Document Handling: Use of document segmentation and batch processing to manage memory load.
β’ Inference Accuracy: Regular updates and fine-tuning on use case-relevant appropriate datasets will help maintain performance.
β’ Thermal and Power Constraints: Efficient scheduling and hardware acceleration will be required to minimize power consumption.
Conclusion and Future Directions
Multimodal AI models represent a transformative advancement for document reformatting in printers. By deploying such models directly on-device, manufacturers can offer smarter, more secure, and more adaptable printing solutions. Multiple use cases can be accomplished by utilizing the power of such models.
This approach sets the stage for a new era of intelligent edge printing, where content understanding and reformatting happen seamlessly at the point of output.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI