| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: text-generation |
| tags: |
| - text-generation |
| - openlm |
| - silo |
| --- |
| |
| # Silo Language Models: Isolating Legal Risk in a Datastore |
|
|
| This is Silo-PD, first introduced in [Silo Language Models]() by researchers at University of Washington, UC Berkeley, and the Allen Institute for AI. |
|
|
| ### NOTE: Dependencies |
|
|
| To use the model, you need to install a specific transformers fork: |
|
|
| ``` |
| pip install git+https://github.com/kernelmachine/transformers@openlm#egg=transformers |
| ``` |
|
|
| The model also depends on `xformers`, install via |
|
|
| ``` |
| pip install xformers |
| ``` |
|
|
| ### Model Description |
|
|
|
|
| Silo-PD is a 1.3B parameter, decoder-only language model trained on data in the public domain from [the Open License Corpus (OLC)](https://huggingface.co/datasets/kernelmachine/open-license-corpus). |
|
|
| The model is based on the LLaMA architecture as implemented in (OpenLM)[]. |
|
|
| The model is trained with 128 A100 GPUs across 16 nodes. |
|
|
|
|
| ### Model and Training Hyperparameters |
|
|
| We follow the model architecture of LLaMa, and we use the GPT-NeoX-20B tokenizer, with 50432 BPE types. |
|
|
| During training, we use 2,048 token sequences that are packed across document boundaries, and we pre-pend a beginning-of-text token to every document. |
|
|
| We use weight decay of 0.1, the Adam optimizer with beta_2 of 0.95, 2,000 steps of warmup, with a cosine learning rate scheduler. |
| |
| |
| | Model | #L | #H | d_model | LR | Batch | |
| |--------|-----|-----|-------------|--------|--------| |
| | 1.3B | 24 | 16 | 2048 | 1e-3 | 2.6M | |
|
|
|
|
|
|
| ### Training data |
| Specifically, it was trained on the following domain proportions (please see the OLC repository for more details on the data sources for each domain): |
|
|
|
|
| | Domain | Tokens (B) | % | |
| |-----------------|------------|-------| |
| | Legal | 27.1 | 86.2 | |
| | Books | 2.9 | 9.3 | |
| | Science | 1.2 | 3.8 | |
| | News | 0.2 | 0.7 | |
| | Total | 31.4 | 100.0 | |
|
|
| We train with early stopping for 60B tokens in total, for a total of 2 epochs of training over this subset |
|
|
| Since the distribution of OLC is highly skewed, we perform a simple upweighting scheme where we upsample all data that accounts for less than 5% of the corpus by a factor of 3x, which we found to work well after a sweep of different settings. |
|
|
| ### Intended Uses and Limitations |
|
|
| This model can be used for prompting for evaluation of downstream tasks as well as text generation. |
|
|
| ### How to use |
|
|
|
|
| You can use this model directly with a pipeline for text generation. |
|
|
|
|
| ```python |
| from transformers import pipeline |
| generator = pipeline('text-generation', model="kernelmachine/silo-pd-1.3b", device='cuda') |
| generator("Hello") |
| [{'generated_text': 'Hello, my dear," said the old man, "I have been waiting for you\na long'}] |
| ``` |
|
|
| By default, generation is deterministic. In order to use the top-k sampling, please set do_sample to True. |
| |
| |
| ```python |
| from transformers import pipeline, set_seed |
| set_seed(42) |
| generator = pipeline('text-generation', model="kernelmachine/silo-pd-1.3b", device='cuda', do_sample=True) |
| generator("Hello") |
| [{'generated_text': 'Hello, Mother," he called.\n\n"Hello, Son. Have you got a car'}] |
| ``` |
| |
| ### Limitations and Bias |
| |
| Silo-PD inherits the biases and limitations of public domain data, which carry risks of toxic or otherwise unfair output, due to the prevalence of older copyright-expired text. |
| |
| Silo-PD may also output personally identifiable information, because we did not filter that out of training data. |
| |