Instructions to use google/gemma-2-9b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-2-9b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="google/gemma-2-9b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use google/gemma-2-9b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-2-9b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-9b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-2-9b-it

SGLang

How to use google/gemma-2-9b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-2-9b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-9b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-2-9b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-2-9b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use google/gemma-2-9b-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-2-9b-it
```

Context length of the model?

#30

by shipWr3ck - opened Jul 16, 2024

Discussion

shipWr3ck

Jul 16, 2024

I did not find that on the description page, what is the max context length of the model?

IgorSiu

Jul 16, 2024

Test

mclassHF2023

Jul 19, 2024

It's in the config.json:
"max_position_embeddings": 8192,

lkv

Google org Aug 8, 2024

Hi @shipWr3ck , You can find the context window size in it's config.json which is 8192. Kindly go through config.json document and let me know if you have any doubts . Thank you.

lunahr

Aug 13, 2024

as with all gemma 2 models, it's 8192.

however they are also compatible with SelfExtend to extend the context to 32768 without degradation.

you need a lot of RAM to perform this action though, A100 or better only.

lkv

Google org Aug 27, 2024

This comment has been hidden

shipWr3ck changed discussion status to closed Aug 31, 2024

thusinh1969

Dec 14, 2024

I wish Gemma next release comatible at least 80k context length (should fit nice ly with H100 80G though). 8192 is rather a lab, useless.

lunahr

Dec 14, 2024

•

edited Dec 14, 2024

I wish Gemma next release comatible at least 80k context length (should fit nice ly with H100 80G though). 8192 is rather a lab, useless.

If you expect a higher context length model from the Gemma series, you may want to stick around for Gemma 3. Maybe Google will make that a 128k model like Llama 3.1 and above already have.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment