| --- |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| tags: |
| - ocr |
| - vision-language-model |
| - document-understanding |
| --- |
| |
| # OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models |
|
|
| OCRVerse is a holistic OCR method that enables unified text-centric OCR (extracting text from documents like books and magazines) and vision-centric OCR (identifying visual elements from information-dense sources like charts, web pages, and scientific plots) in an end-to-end manner. |
|
|
| - **Paper:** [OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models](https://huggingface.co/papers/2601.21639) |
| - **GitHub Repository:** [DocTron-hub/OCRVerse](https://github.com/DocTron-hub/OCRVerse) |
|
|
| ## Usage Example |
|
|
| To use OCRVerse, please ensure you have the `transformers` library installed: |
|
|
| ```shell |
| pip install "transformers>=4.57.0" |
| ``` |
|
|
| ### Text-Centric Document Parsing |
|
|
| Below is a simple example of how to use OCRVerse for document parsing tasks. |
|
|
| ```python |
| from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
| import torch |
| |
| # Load model |
| model_path = 'DocTron/OCRVerse' |
| model = Qwen3VLForConditionalGeneration.from_pretrained( |
| model_path, |
| dtype="auto", |
| device_map="cuda", |
| trust_remote_code=True |
| ) |
| processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
| |
| # Prepare input with image and text |
| image_path = "path/to/your/image.jpg" |
| # We recommend using the following prompt for better performance |
| prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML." |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image", "image": image_path}, |
| {"type": "text", "text": prompt}, |
| ] |
| } |
| ] |
| |
| # Preparation for inference |
| inputs = processor.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_dict=True, |
| return_tensors="pt" |
| ) |
| inputs = inputs.to(model.device) |
| |
| # Inference: Generation of the output |
| generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False) |
| |
| generated_ids = [ |
| output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids) |
| ] |
| output_text = processor.tokenizer.batch_decode( |
| generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| ) |
| print(output_text[0]) |
| ``` |
|
|
| ## Citation |
|
|
| If you find this project useful, please cite our paper: |
|
|
| ```bibtex |
| @article{zhong2026ocrverse, |
| title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models}, |
| author={Yufeng Zhong and Lei Chen and Xuanle Zhao and Wenkang Han and Liming Zheng and Jing Huang and Deyang Jiang and Yilin Cao and Lin Ma and Zhixiong Zeng}, |
| journal={arXiv preprint arXiv:2601.21639}, |
| year={2026} |
| } |
| ``` |