Instructions to use NousResearch/OLMo-Bitnet-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NousResearch/OLMo-Bitnet-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NousResearch/OLMo-Bitnet-1B", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("NousResearch/OLMo-Bitnet-1B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NousResearch/OLMo-Bitnet-1B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use NousResearch/OLMo-Bitnet-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NousResearch/OLMo-Bitnet-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/OLMo-Bitnet-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/NousResearch/OLMo-Bitnet-1B
- SGLang
How to use NousResearch/OLMo-Bitnet-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NousResearch/OLMo-Bitnet-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/OLMo-Bitnet-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NousResearch/OLMo-Bitnet-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NousResearch/OLMo-Bitnet-1B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use NousResearch/OLMo-Bitnet-1B with Docker Model Runner:
docker model run hf.co/NousResearch/OLMo-Bitnet-1B
Is it bitnet {-1,0,1}?
I looked through many bitnet1.58 implementations and noticed that they all use the method suggested in "The Era from 1-bit LLMs: Training Tips, Code and FAQ". The weights of the models that are currently trained according to this recipe are not numbers in the set {-1, 0, 1} and values in the interval (0,1). Is this the way it should be?
- The formula describing the quanztization of weights ("The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits").
- Implementation proposal ("The Era of 1-bit LLMs: Training Tips, Code and FAQ").
- Weights quantization test.
- Model during training.
Sadly no. Its fp16, honestly I don't understand reason for training on fp16, why is research not carried forward from where the paper left? Why not train another 1bit model but either with more parameters or with more trainign data or for longer, even better yet for a good combination of these! It was already shown in paper that 1bit is a good contender to all other fp models (or int quants) so why even bother other things? Anyway, I hope someone can carry forward research from here without needing google resources (pun intended).
