SFT Llama-3 8b on Colab

#LLM #Llama

Tuesday, April 30, 2024 4:58:39 AM UTC

SFT Llama-3 8b on Colab

Thanks to Unsloth and SFTTrainer, fine tune Llama-3 is much easier.

# .......................
# code block...
# .......................
from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned", split = "train")

dataset = dataset.map(formatting_prompts_func, batched = True,)

from trl import SFTTrainer

from transformers import TrainingArguments

  

trainer = SFTTrainer(

    model = model,

    tokenizer = tokenizer,

    train_dataset = dataset,

    dataset_text_field = "text",

    max_seq_length = max_seq_length,

    dataset_num_proc = 2,

    packing = False, # Can make training 5x faster for short sequences.

    args = TrainingArguments(

        per_device_train_batch_size = 2,

        gradient_accumulation_steps = 4,

        warmup_steps = 5,

        max_steps = 60,

        learning_rate = 2e-4,

        fp16 = not torch.cuda.is_bf16_supported(),

        bf16 = torch.cuda.is_bf16_supported(),

        logging_steps = 1,

        optim = "adamw_8bit",

        weight_decay = 0.01,

        lr_scheduler_type = "linear",

        seed = 3407,

        output_dir = "outputs",

    ),

)

Now, trainer_stats = trainer.train().

(()) Unsloth - 2x faster free finetuning | Num GPUs = 1
\ /| Num examples = 51,760 | Num Epochs = 1
O^O/ _/ \ Batch size per device = 2 | Gradient Accumulation steps = 4
\ / Total batch size = 8 | Total steps = 60
"-____-" Number of trainable parameters = 41,943,040

[20/60 02:08 < 04:45, 0.14 it/s, Epoch 0.00/1]

Step	Training Loss
1	1.823800
2	2.305700
3	1.692200
4	1.947700
5	1.644100
6	1.606800
7	1.193300
8	1.259400
9	1.109100
10	1.165600
11	0.965300
12	1.003100
13	0.935100
14	1.060900
15	0.909700
16	0.911000
17	1.023800
18	1.285900

[60/60 07:14, Epoch 0/1]

Step	Training Loss
1	1.823800
2	2.305700
3	1.692200
4	1.947700
5	1.644100
6	1.606800
7	1.193300
8	1.259400
...	...
49	0.915200
50	1.049000
51	1.025600
52	0.927200
53	1.008400
54	1.169600
55	0.799000
56	1.029200
57	0.887200
58	0.830100
59	0.862400
60	0.905500

453.9762 seconds used for training. 7.57 minutes used for training. Peak reserved memory = 7.529 GB. Peak reserved memory for training = 1.935 GB. Peak reserved memory % of max memory = 51.051 %. Peak reserved memory for training % of max memory = 13.12 %.

Test:

# alpaca_prompt = Copied from above

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(

[

    alpaca_prompt.format(

        "Continue the fibonnaci sequence.", # instruction

        "1, 1, 2, 3, 5, 8", # input

        "", # output - leave this blank for generation!

    )

], return_tensors = "pt").to("cuda")

  

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

tokenizer.batch_decode(outputs)

Results:

Setting pad_token_id to eos_token_id: 128001 for open-end generation.
['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n 1, 1, 2, 3, 5, 8\n\n### Response:\n 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025']

Then, you can save the model!

Local Saving: Use model.save_pretrained("lora_model") to save locally.
Online Saving: Use model.push_to_hub("your_name/lora_model", token = "...") to save online to Hugging Face’s model hub.
Saving to Float16: Use model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",) for saving in float16 format.
Saving to 4 bit: Use model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",) for saving in 4 bit format.

To save to GGUF / llama.cpp, you also can use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

And , one more question.

# alpaca_prompt = Copied from above

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(

[

    alpaca_prompt.format(

        "Answer the thought-provoking question.", # instruction

        "Is China the West 'North Korea'?", # input

        "", # output - leave this blank for generation!

    )

], return_tensors = "pt").to("cuda")

  

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)

_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 256)

Setting pad_token_id to eos_token_id: 128001 for open-end generation.

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

Instruction:

Answer the thought-provoking question.

Input:

Is China the West 'North Korea'?

Response:

China is not the West's "North Korea." While China and North Korea share some similarities, such as being communist countries, they are two distinct nations with different histories, cultures, and political systems. China is a much larger and more powerful country than North Korea, with a population of over 1.4 billion people and a GDP of over $14 trillion. North Korea, on the other hand, has a population of around 25 million people and a GDP of around $40 billion. China is also a member of the United Nations and has a seat on the UN Security Council, while North Korea is not a member of the UN and has been subject to international sanctions for its nuclear weapons program. In conclusion, while China and North Korea may share some similarities, they are two separate countries with different political systems, economies, and international standing.<|end_of_text|>

Why so serious?