Optimizing Models: Fine-Tuning, RAG and Application Strategies

HadilBENAmor · ‎Sep 19 2024

Before diving in, let's take a moment to review the key resources and foundational concepts that will guide us through this blog. That will ensure we're well-equipped to follow along. This brief review will provide a strong starting point for exploring the main topics ahead.

Microsoft Azure: Microsoft offers a cloud computing platform and a suite of cloud services. It provides a wide range of cloud-based
services and solutions that enable organizations to build, deploy, and manage applications and services through Microsoft's global network of data centers.
AI Studio: a platform that helps you evaluate model responses and orchestrate prompt application components with prompt flow for better performance. The platform facilitates scalability for transforming proof of concepts into full-fledged production with ease, continuous monitoring and refinement support long-term success.

Fine-tuning: is the process of retraining pretrained models on specific datasets. The purpose is typically to improve model performance on specific tasks or to introduce information that wasn't well represented when you originally trained the base model.

Retrieval Augmented Generation (RAG): is a pattern that works with pretrained large language models (LLM) and your own data to generate responses. In Azure Machine Learning, you can implement RAG in a prompt flow.

Our hands-on learning will be developing an AI-based solution that helps the user extract financial information and insights from investment/finance books and newspaper in our database.

The process is divided into three main parts:

Fine-tune a base model with financial data to help the model provide more specific responses and be grounded and rooted with data related to finance and investment.
Implement RAG so that the response won’t be only based on the data it was trained with (fine-tuned with) but also based on other data sources (the user’s input in our case).
Integration of the deployed model into a web app so that it could be used through a user interface.

1- Setup:

Create a resource group which is defined as a container that holds related resources for an Azure solution. The resource group can include all the resources for the solution, or only those resources that you want to manage as a group.
You need to specify your subscription, a unique resource group name, and the region.
Create an Azure OpenAI resource: Azure OpenAI Service provides REST API access to OpenAI's powerful language models including GPT-4o, GPT-4 Turbo with Vision, GPT-4, GPT-3.5-Turbo, and Embeddings model series. These models can be easily adapted to your specific task including but not limited to content generation, summarization, image understanding, semantic search, and natural language to code translation

Spoiler
Note: If you think of deploying or finetuning a specific model, please check the model's availability and create your Azure OpenAI resource for that region.
- Create a text embedding model: the embedding is an information-dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format.
Create an AI search resource: Azure AI Search ("Azure Cognitive Search" previously) provides secure information retrieval at scale over user-owned content in traditional and generative AI search applications. Information retrieval is foundational to any app that surfaces text and vectors. Common scenarios include data exploration, and increasingly feeding query results to prompts based on your proprietary grounding data for conversational search as we will do in our example.
Create a storage account: it contains all your Azure Storage data objects: blobs, files, queues, and tables. The storage account provides a unique namespace for your Azure Storage data that is accessible from anywhere in the world over HTTP or HTTPS.

Spoiler
Note: Locally redundant storage (LRS) replicates your storage account three times within a single data center in the primary region. LRS provides at least 99.999% durability of objects over a given year. LRS is the lowest-cost redundancy option and offers the least durability compared to other options. For Azure students’ subscription for example, this choice is the most cost-effective.
- Create a blob container: blob Storage is Microsoft's object optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data. it will be used to store your data.
Navigate to your storage resource -> Click on Storage browser tab on the left -> Click Blob Containers -> Click on + add container then Upload your data. Our data was pdf files (books and newspapers) and csv files from Kaggle, all are related to finance and investment.
Create a search Index: is your searchable content, available to the search engine for indexing, full text search, vector search, hybrid search, and filtered queries. Check that the status of your AI search is "Running".
- Import and vectorize data: integrated vectorization is an extension of the indexing and query pipelines in Azure AI Search. It adds the following capabilities: Data chunking (split the data into smaller, manageable pieces) during indexing. Text-to-vector conversion during indexing.
Navigate to you AI search service -> Click on Indexes tab on the left -> Click on "Import and Vectorize data" -> Select your text embedding model previously deployed.

2- Fine Tune a Base Model

Preparing Dataset: before starting fine tuning, we need to prepare our training and validation data. Which could be collected manually or generated from text files or tabular data. However, it is important that the data confirm the following template:

{"messages": [{"role": "system", "content": "content goes here"}, {"role": "user", "content": "query goes here?"}, {"role": "assistant", "content": "response goes here."}]}.

To satisfy this, we prepared our two data sets training_data.jsonl and val_data.jsonl for training and validation, respectively.

Spoiler

Note: JSONL is a text-based format using the .jsonl file extension that is basically the same as JSON format but implemented using newline characters to separate JSON values. It is also known as a JSON line.

Spoiler

It is recommended to have at least 50 high-quality examples to start out. However, it's entirely possible to have a use case that might require thousands of high-quality training examples to be successful. Low-quality data will affect the model’s performance.

Below is an example of our training_data.jsonl:

{"messages": [{"role": "system", "content": "You are a helper in finance and an advisor in investment"}, {"role": "user", "content": "What makes a successful startup?"}, {"role": "assistant", "content": "Luck, hard work, and maybe skills... mostly luck, hard work, and consistency."}]}

Both data files are attached to this blog. They were collected manually through some examples.

Evaluate data to ensure its quality, check number of tokens and its distribution.

import json
import tiktoken
import numpy as np
from collections import defaultdict

encoding = tiktoken.get_encoding("cl100k_base")

def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        if not isinstance(message, dict):
            print(f"Unexpected message format: {message}")
            continue
        num_tokens += tokens_per_message
        for key, value in message.items():
            if not isinstance(value, str):
                print(f"Unexpected value type for key '{key}': {value}")
                continue
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if not isinstance(message, dict):
            print(f"Unexpected message format: {message}")
            continue
        if message.get("role") == "assistant":
            content = message.get("content", "")
            if not isinstance(content, str):
                print(f"Unexpected content type: {content}")
                continue
            num_tokens += len(encoding.encode(content))
    return num_tokens

def print_distribution(values, name):
    if values:
        print(f"\n#### Distribution of {name}:")
        print(f"min / max: {min(values)}, {max(values)}")
        print(f"mean / median: {np.mean(values)}, {np.median(values)}")
        print(f"p5 / p95: {np.quantile(values, 0.05)}, {np.quantile(values, 0.95)}")
    else:
        print(f"No values to display for {name}")

files = [
    r'train_data.jsonl',
    r'val_data.jsonl'
]

for file in files:
    print(f"Processing file: {file}")
    try:
        with open(file, 'r', encoding='utf-8') as f:
            total_tokens = []
            assistant_tokens = []
            for line in f:
                try:
                    ex = json.loads(line)
                    messages = ex.get("messages", [])
                    if not isinstance(messages, list):
                        raise ValueError("The 'messages' field should be a list.")
                    total_tokens.append(num_tokens_from_messages(messages))
                    assistant_tokens.append(num_assistant_tokens_from_messages(messages))
                except json.JSONDecodeError:
                    print(f"Error decoding JSON line: {line}")
                except ValueError as ve:
                    print(f"ValueError: {ve} - line: {line}")
                except Exception as e:
                    print(f"Unexpected error processing line: {e} - line: {line}")

            if total_tokens and assistant_tokens:
                print_distribution(total_tokens, "total tokens")
                print_distribution(assistant_tokens, "assistant tokens")
            else:
                print("No valid data to process.")

            print('*' * 50)

    except FileNotFoundError:
        print(f"File not found: {file}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Login to AI Studio
Navigate to the Fine-tuning tab
Check the available models for fine-tuning within your region.
Spoiler
Please make sure to have enough quota available. Not enough quota may result in the non-availability of the model when exceeding the number of limited tokens and may also result in slower response (high latency).
Upload your training and validation data

Since we have our data locally, we uploaded them. In case you want to save your data in the cloud and use the URL for later in place of the "Uploading files" option, you can use SDK and follow this code:

# Initialize AzureOpenAI client

client = AzureOpenAI(

    azure_endpoint=azure_oai_endpoint,

    api_key=azure_oai_key,

    api_version=version # Ensure this API version is correct

)

training_file_name = r'path’

validation_file_name = r'path’

try:

    # Upload the training dataset file

    with open(training_file_name, "rb") as file:

        training_response = client.files.create(

            file=file, purpose="fine-tune"

        )

    training_file_id = training_response.id

    print("Training file ID:", training_file_id)

except Exception as e:

    print(f"Error uploading training file: {e}")

try:

    # Upload the validation dataset file

    with open(validation_file_name, "rb") as file:

        validation_response = client.files.create(

            file=file, purpose="fine-tune"

        )

    validation_file_id = validation_response.id

    print("Validation file ID:", validation_file_id)



except Exception as e:

    print(f"Error uploading validation file: {e}")

You can specify the hyperparameters such as batch size, or leave them with default values.
Review the settings before submitting

Check the status of the fine-tuning in your dashboard, changing from Queued to Running to Completed.
Once completed, your fine-tuned model is ready to be deployed. Click on ‘Deploy’
After successful deployment, you can go back to Azure Open AI and find your fine-tuned model deployed along with your previous text embedding model.

3- Integration into Web App

The concept here is to rely on the model's knowledge + users’ documentation. We have two options and both provide high precision for responses:

Look for the answer in the documents, and if not found, return a response based on the internal knowledge of the model.
Combine the two responses from the retriever and the model. Which is the one we opt for here.

Also, for integration, we have two ways we may follow: through the Azure OpenAI User Interface and deploying into an Azure static web app or develop your own web app and use the Azure SDK to integrate your model.

1- Deploying into Azure static web app

Click on "Open in Playground" below your deployments list in Azure open AI
Click "Add your data"
Choose your Azure blob storage as data source à Choose Index name "myindex"
Customize the system message to "You are a financial advisor and an expert in investment. You have access to a wide variety of documents. Use your own knowledge to answer the question and verify it or supplement it using the relevant documents when possible." This system message will enable the model not only to rely on documents but also rely on its internal knowledge.
Complete the setup and click on "Apply changes"
Deploy to a new web app and configure the web app name, subscription, resource group, location, and pricing plan.

2- Develop your own web App and use Azure SDK

Prepare your environment

load_dotenv ()

        azure_oai_endpoint = os.getenv("AZURE_OAI_FINETUNE_ENDPOINT2")

        azure_oai_key = os.getenv("AZURE_OAI_FINETUNE_KEY2")

        azure_oai_deployment = os.getenv("AZURE_OAI_FINETUNE_DEPLOYMENT2")

        azure_search_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")

        azure_search_key = os.getenv("AZURE_SEARCH_KEY")

        azure_search_index = os.getenv("AZURE_SEARCH_INDEX")

Initialize your AzureOpenAI client

client = AzureOpenAI(

            base_url=f"{azure_oai_endpoint}/openai/deployments/{azure_oai_deployment}/extensions",

            api_key=azure_oai_key,

            api_version="2023-09-01-preview)

Configure your data source for Azure AI search. This will retrieve response from our stored files.

 extension_config = dict(

            dataSources= [

                {

                    "type": "AzureCognitiveSearch",

                    "parameters": {

                        "endpoint": azure_search_endpoint,

                        "key": azure_search_key,

                        "indexName": azure_search_index,

                    }

                }

            ]

        )

Spoiler

Note: When you implement RAG, make sure that the chat response won’t rely solely on it. You may get a response such as “This information is not available in your data source,” which indicates that the model is based only on searching your data and did not provide answers from the data it was trained with to generate suitable responses.

RAG is used to enhance a model's capabilities by adding more grounded information, not to eliminate the model’s internal knowledge.

Spoiler

Some issues that you may face during development:

Issue 1: make sure to verify the OpenAI version. You can pin the version to openai=0.28 or upgrade it and follow migration steps.
Issue 2: you may run out of quota and be asked to wait for 24 hours till the next try. Make sure to always have enough quota in your subscription.

Next, you can look at how to do real-time injection so that you personalize more of the responses. Try to find how to rely between your web app, the user's input I/O, the searching index, and LLM.
Keyword: Langchain, Databricks

Products (49)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Optimizing Models: Fine-Tuning, RAG and Application Strategies

1- Setup:

2- Fine Tune a Base Model

3- Integration into Web App

Resources: