Granite-Docling Resolution Gate (LoRA)
A production-ready autoregressive model for predicting resolution sufficiency in document understanding tasks. Uses efficient LoRA fine-tuning on IBM's Granite-Docling foundation model.
Model Details
Model Architecture
- Base Model: IBM Granite-Docling-258M
- Approach: Supervised Fine-Tuning (SFT) with LoRA parameter-efficient adapters
- LoRA Configuration: Rank=4, Alpha=16, Dropout=0.05
- Trainable Parameters: 1.4M (0.56% of base model)
- Total Parameters: 259M
- Output Type: Autoregressive text generation
Key Features
🚀 Production-Ready: Self-contained model with LoRA adapters 🔍 Interpretable: Direct text output showing reasoning 📊 Efficient: Only 0.56% trainable parameters via LoRA 🎯 Accurate: Autoregressive token-level learning ⚙️ Deployable: Easy integration with standard HF APIs
Model Card
Intended Use
This model predicts whether sufficient visual information is present at different resolutions to accurately answer questions about document images. It generates direct text predictions indicating resolution sufficiency.
Primary Use Cases:
- Production document understanding systems
- Multi-resolution processing pipelines
- Document analysis applications
- Enterprise document processing
- Intelligent resolution adaptation
Supported Predictions
The model generates text outputs indicating:
- Whether resolution is sufficient for the query
- Recommended resolution level (low/medium/high)
- Confidence level for the prediction
Training Details
Dataset
- Name:
hardness_data_mix - Samples: 81,924 document image-question pairs
- Split: 90% train / 10% validation (stratified)
- Labels: 3-class resolution requirements
- Domains: TextVQA, DocVQA, ChartQA, InfographicVQA, HME100K
Training Configuration
Base Model: ibm-granite/granite-docling-258M
Batch Size: 32
Learning Rate: 1e-4
Optimizer: AdamW with warmup
Epochs: 6
LoRA Rank: 4
LoRA Alpha: 16
LoRA Dropout: 0.05
Mixed Precision: bfloat16 (when available)
Hardware: NVIDIA H100 (80GB)
Framework: PyTorch + Transformers + TRL + PEFT
Hyperparameters
--model_name: ibm-granite/granite-docling-258M--bsz: 32--lr: 1e-4--epochs: 6--llm_lora_r: 4--lora_alpha: 16--lora_dropout: 0.05--val_frac: 0.1--seed: 42
Performance Metrics
Evaluated on stratified validation set:
- Token-level Accuracy: ~57%
- Mean Token Accuracy: 56.7%
- Evaluation Loss: 2.34
- Entropy: 4.25
- Training Loss (final): 4.06
Usage
Installation
pip install transformers torch peft
Load Model with LoRA
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
import torch
# Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
"ibm-granite/granite-docling-258M",
trust_remote_code=True,
device_map="auto"
)
# Load LoRA adapters
model = PeftModel.from_pretrained(
base_model,
"Kimhi/granite-docling-res-gate-lora"
)
# Load processor
processor = AutoProcessor.from_pretrained(
"ibm-granite/granite-docling-258M",
trust_remote_code=True
)
Inference
from PIL import Image
# Prepare inputs
image = Image.open("document.jpg").convert("RGB")
question = "Is the current resolution sufficient to answer this question?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda")
# Generate prediction
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=15,
do_sample=False
)
# Decode
prediction = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Resolution prediction: {prediction}")
Batch Processing
# Process multiple documents
batch_images = [Image.open(f).convert("RGB") for f in image_paths]
batch_questions = ["Is resolution sufficient?" for _ in batch_images]
inputs = processor(
images=batch_images,
text=batch_questions,
return_tensors="pt",
padding=True
).to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=15,
num_beams=1
)
predictions = processor.batch_decode(outputs, skip_special_tokens=True)
Merge LoRA Weights (Optional)
from peft import get_peft_model
# Merge LoRA adapters into base model
merged_model = get_peft_model(model).merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged_model")
processor.save_pretrained("./merged_model")
Limitations
- Primarily trained on document-centric datasets
- Performance depends on document image quality
- May not generalize to very different document types
- Autoregressive generation can be slower than classification
- Best performance on English documents
Alternative Approach
For a lightweight, fast alternative using frozen features: 👉 SmolVLM Resolution Gate
| Aspect | Granite-Docling | SmolVLM |
|---|---|---|
| Model Size | 258M | 256M |
| Approach | SFT with LoRA | Frozen + classifier |
| Trainable Params | 1.4M | 64K |
| Inference Type | Autoregressive | Classification |
| Inference Speed | Medium ⚡⚡ | Fast ⚡ |
| Output | Direct text | Confidence scores |
| Deployment | Production servers | On-device |
Technical Details
LoRA Configuration
- LoRA Rank (r): 4
- LoRA Alpha: 16
- LoRA Dropout: 0.05
- Target Modules: Language model projections
- Initialization: Random Gaussian
Loss Function
- Type: Causal language modeling (autoregressive)
- Masking: Pad tokens and image tokens masked
- Weighting: Stratified sampling by class
Citation
If you use this model, please cite:
@misc{kimhi2025carescontextawareresolutionselector,
title={CARES: Context-Aware Resolution Selector for VLMs},
author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
year={2025},
eprint={2510.19496},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
License
Apache 2.0 - See LICENSE file for details
Base model (IBM Granite) is also Apache 2.0 licensed.
Acknowledgements
- Built on IBM's Granite-Docling
- Trained using Hugging Face Transformers
- LoRA implementation via PEFT
- Training via TRL
Model Sources
- Base Model: Granite-Docling-258M
- PEFT Documentation: Parameter-Efficient Fine-Tuning
- Training: SFT Trainer
- Project: CARES GitHub
Model tree for Kimhi/granite-docling-res-gate-lora
Base model
ibm-granite/granite-docling-258M