Granite-Docling Resolution Gate (LoRA)

A production-ready autoregressive model for predicting resolution sufficiency in document understanding tasks. Uses efficient LoRA fine-tuning on IBM's Granite-Docling foundation model.

Model Details

Model Architecture

  • Base Model: IBM Granite-Docling-258M
  • Approach: Supervised Fine-Tuning (SFT) with LoRA parameter-efficient adapters
  • LoRA Configuration: Rank=4, Alpha=16, Dropout=0.05
  • Trainable Parameters: 1.4M (0.56% of base model)
  • Total Parameters: 259M
  • Output Type: Autoregressive text generation

Key Features

🚀 Production-Ready: Self-contained model with LoRA adapters 🔍 Interpretable: Direct text output showing reasoning 📊 Efficient: Only 0.56% trainable parameters via LoRA 🎯 Accurate: Autoregressive token-level learning ⚙️ Deployable: Easy integration with standard HF APIs

Model Card

Intended Use

This model predicts whether sufficient visual information is present at different resolutions to accurately answer questions about document images. It generates direct text predictions indicating resolution sufficiency.

Primary Use Cases:

  • Production document understanding systems
  • Multi-resolution processing pipelines
  • Document analysis applications
  • Enterprise document processing
  • Intelligent resolution adaptation

Supported Predictions

The model generates text outputs indicating:

  • Whether resolution is sufficient for the query
  • Recommended resolution level (low/medium/high)
  • Confidence level for the prediction

Training Details

Dataset

  • Name: hardness_data_mix
  • Samples: 81,924 document image-question pairs
  • Split: 90% train / 10% validation (stratified)
  • Labels: 3-class resolution requirements
  • Domains: TextVQA, DocVQA, ChartQA, InfographicVQA, HME100K

Training Configuration

Base Model: ibm-granite/granite-docling-258M
Batch Size: 32
Learning Rate: 1e-4
Optimizer: AdamW with warmup
Epochs: 6
LoRA Rank: 4
LoRA Alpha: 16
LoRA Dropout: 0.05
Mixed Precision: bfloat16 (when available)
Hardware: NVIDIA H100 (80GB)
Framework: PyTorch + Transformers + TRL + PEFT

Hyperparameters

  • --model_name: ibm-granite/granite-docling-258M
  • --bsz: 32
  • --lr: 1e-4
  • --epochs: 6
  • --llm_lora_r: 4
  • --lora_alpha: 16
  • --lora_dropout: 0.05
  • --val_frac: 0.1
  • --seed: 42

Performance Metrics

Evaluated on stratified validation set:

  • Token-level Accuracy: ~57%
  • Mean Token Accuracy: 56.7%
  • Evaluation Loss: 2.34
  • Entropy: 4.25
  • Training Loss (final): 4.06

Usage

Installation

pip install transformers torch peft

Load Model with LoRA

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
    "ibm-granite/granite-docling-258M",
    trust_remote_code=True,
    device_map="auto"
)

# Load LoRA adapters
model = PeftModel.from_pretrained(
    base_model,
    "Kimhi/granite-docling-res-gate-lora"
)

# Load processor
processor = AutoProcessor.from_pretrained(
    "ibm-granite/granite-docling-258M",
    trust_remote_code=True
)

Inference

from PIL import Image

# Prepare inputs
image = Image.open("document.jpg").convert("RGB")
question = "Is the current resolution sufficient to answer this question?"

inputs = processor(images=image, text=question, return_tensors="pt").to("cuda")

# Generate prediction
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=15,
        do_sample=False
    )

# Decode
prediction = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Resolution prediction: {prediction}")

Batch Processing

# Process multiple documents
batch_images = [Image.open(f).convert("RGB") for f in image_paths]
batch_questions = ["Is resolution sufficient?" for _ in batch_images]

inputs = processor(
    images=batch_images,
    text=batch_questions,
    return_tensors="pt",
    padding=True
).to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=15,
        num_beams=1
    )

predictions = processor.batch_decode(outputs, skip_special_tokens=True)

Merge LoRA Weights (Optional)

from peft import get_peft_model

# Merge LoRA adapters into base model
merged_model = get_peft_model(model).merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")

Limitations

  • Primarily trained on document-centric datasets
  • Performance depends on document image quality
  • May not generalize to very different document types
  • Autoregressive generation can be slower than classification
  • Best performance on English documents

Alternative Approach

For a lightweight, fast alternative using frozen features: 👉 SmolVLM Resolution Gate

Aspect Granite-Docling SmolVLM
Model Size 258M 256M
Approach SFT with LoRA Frozen + classifier
Trainable Params 1.4M 64K
Inference Type Autoregressive Classification
Inference Speed Medium ⚡⚡ Fast ⚡
Output Direct text Confidence scores
Deployment Production servers On-device

Technical Details

LoRA Configuration

  • LoRA Rank (r): 4
  • LoRA Alpha: 16
  • LoRA Dropout: 0.05
  • Target Modules: Language model projections
  • Initialization: Random Gaussian

Loss Function

  • Type: Causal language modeling (autoregressive)
  • Masking: Pad tokens and image tokens masked
  • Weighting: Stratified sampling by class

Citation

If you use this model, please cite:

@misc{kimhi2025carescontextawareresolutionselector,
      title={CARES: Context-Aware Resolution Selector for VLMs}, 
      author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
      year={2025},
      eprint={2510.19496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

License

Apache 2.0 - See LICENSE file for details

Base model (IBM Granite) is also Apache 2.0 licensed.

Acknowledgements

Model Sources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kimhi/granite-docling-res-gate-lora

Adapter
(7)
this model

Dataset used to train Kimhi/granite-docling-res-gate-lora

Space using Kimhi/granite-docling-res-gate-lora 1

Paper for Kimhi/granite-docling-res-gate-lora