Granite-Docling Resolution Gate (LoRA)

A production-ready autoregressive model for predicting resolution sufficiency in document understanding tasks. Uses efficient LoRA fine-tuning on IBM's Granite-Docling foundation model.

Model Details

Model Architecture

Base Model: IBM Granite-Docling-258M
Approach: Supervised Fine-Tuning (SFT) with LoRA parameter-efficient adapters
LoRA Configuration: Rank=4, Alpha=16, Dropout=0.05
Trainable Parameters: 1.4M (0.56% of base model)
Total Parameters: 259M
Output Type: Autoregressive text generation

Key Features

🚀 Production-Ready: Self-contained model with LoRA adapters 🔍 Interpretable: Direct text output showing reasoning 📊 Efficient: Only 0.56% trainable parameters via LoRA 🎯 Accurate: Autoregressive token-level learning ⚙️ Deployable: Easy integration with standard HF APIs

Model Card

Intended Use

This model predicts whether sufficient visual information is present at different resolutions to accurately answer questions about document images. It generates direct text predictions indicating resolution sufficiency.

Primary Use Cases:

Production document understanding systems
Multi-resolution processing pipelines
Document analysis applications
Enterprise document processing
Intelligent resolution adaptation

Supported Predictions

The model generates text outputs indicating:

Whether resolution is sufficient for the query
Recommended resolution level (low/medium/high)
Confidence level for the prediction

Training Details

Dataset

Name: hardness_data_mix
Samples: 81,924 document image-question pairs
Split: 90% train / 10% validation (stratified)
Labels: 3-class resolution requirements
Domains: TextVQA, DocVQA, ChartQA, InfographicVQA, HME100K

Training Configuration

Base Model: ibm-granite/granite-docling-258M
Batch Size: 32
Learning Rate: 1e-4
Optimizer: AdamW with warmup
Epochs: 6
LoRA Rank: 4
LoRA Alpha: 16
LoRA Dropout: 0.05
Mixed Precision: bfloat16 (when available)
Hardware: NVIDIA H100 (80GB)
Framework: PyTorch + Transformers + TRL + PEFT

Hyperparameters

--model_name: ibm-granite/granite-docling-258M
--bsz: 32
--lr: 1e-4
--epochs: 6
--llm_lora_r: 4
--lora_alpha: 16
--lora_dropout: 0.05
--val_frac: 0.1
--seed: 42

Performance Metrics

Evaluated on stratified validation set:

Token-level Accuracy: ~57%
Mean Token Accuracy: 56.7%
Evaluation Loss: 2.34
Entropy: 4.25
Training Loss (final): 4.06

Usage

Installation

pip install transformers torch peft

Load Model with LoRA

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    trust_remote_code=True,
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(
    base_model,
    "Kimhi/granite-docling-res-gate-lora"
)

# Load processor
processor = AutoProcessor.from_pretrained(
    "ibm-granite/granite-docling-258M",
    trust_remote_code=True
)

Inference

from PIL import Image

# Prepare inputs
image = Image.open("document.jpg").convert("RGB")
question = "Is the current resolution sufficient to answer this question?"

inputs = processor(images=image, text=question, return_tensors="pt").to("cuda")

# Generate prediction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=15,
        do_sample=False
    )

# Decode
prediction = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Resolution prediction: {prediction}")

Batch Processing

# Process multiple documents
batch_images = [Image.open(f).convert("RGB") for f in image_paths]
batch_questions = ["Is resolution sufficient?" for _ in batch_images]

inputs = processor(
    images=batch_images,
    text=batch_questions,
    return_tensors="pt",
    padding=True
).to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=15,
        num_beams=1
    )

predictions = processor.batch_decode(outputs, skip_special_tokens=True)

Merge LoRA Weights (Optional)

from peft import get_peft_model

# Merge LoRA adapters into base model
merged_model = get_peft_model(model).merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")

Limitations

Primarily trained on document-centric datasets
Performance depends on document image quality
May not generalize to very different document types
Autoregressive generation can be slower than classification
Best performance on English documents

Alternative Approach

For a lightweight, fast alternative using frozen features: 👉 SmolVLM Resolution Gate

Aspect	Granite-Docling	SmolVLM
Model Size	258M	256M
Approach	SFT with LoRA	Frozen + classifier
Trainable Params	1.4M	64K
Inference Type	Autoregressive	Classification
Inference Speed	Medium ⚡⚡	Fast ⚡
Output	Direct text	Confidence scores
Deployment	Production servers	On-device

Technical Details

LoRA Configuration

LoRA Rank (r): 4
LoRA Alpha: 16
LoRA Dropout: 0.05
Target Modules: Language model projections
Initialization: Random Gaussian

Loss Function

Type: Causal language modeling (autoregressive)
Masking: Pad tokens and image tokens masked
Weighting: Stratified sampling by class

Citation

If you use this model, please cite:

@misc{kimhi2025carescontextawareresolutionselector,
      title={CARES: Context-Aware Resolution Selector for VLMs}, 
      author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
      year={2025},
      eprint={2510.19496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

License

Apache 2.0 - See LICENSE file for details

Base model (IBM Granite) is also Apache 2.0 licensed.

Acknowledgements

Built on IBM's Granite-Docling
Trained using Hugging Face Transformers
LoRA implementation via PEFT
Training via TRL

Model Sources

Base Model: Granite-Docling-258M
PEFT Documentation: Parameter-Efficient Fine-Tuning
Training: SFT Trainer
Project: CARES GitHub

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Kimhi/granite-docling-res-gate-lora

Base model

ibm-granite/granite-docling-258M

Adapter

(7)

this model

Dataset used to train Kimhi/granite-docling-res-gate-lora

Space using Kimhi/granite-docling-res-gate-lora 1

Paper for Kimhi/granite-docling-res-gate-lora

CARES: Context-Aware Resolution Selector for VLMs

Paper • 2510.19496 • Published Oct 22, 2025 • 9