GenerativeAIExamples/docs/text-splitter.md at main

Text Splitter Customizations

Updating the Model Name

The default text splitter is a SentenceTransformersTokenTextSplitter instance. The text splitter uses a pre-trained model from Hugging Face to identify sentence boundaries. You can change the model used by setting the APP_TEXTSPLITTER_MODELNAME environment variable in the chain-server service of your docker-compose.yaml file like the following example:

services:
  chain-server:
    environment:
      APP_TEXTSPLITTER_MODELNAME: intfloat/e5-large-v2

Adjusting Chunk Size and Overlap

The text splitter divides documents into smaller chunks for processing. You can control the chunk size and overlap using environment variables in chain-server service of your docker-compose.yaml file:

APP_TEXTSPLITTER_CHUNKSIZE: Sets the maximum number of tokens allowed in each chunk.
APP_TEXTSPLITTER_CHUNKOVERLAP: Defines the number of tokens that overlap between consecutive chunks.

services:
  chain-server:
    environment:
      APP_TEXTSPLITTER_CHUNKSIZE: 256
      APP_TEXTSPLITTER_CHUNKOVERLAP: 128

Using a Custom Text Splitter

While the default text splitter works well, you can also implement a custom splitter for specific needs.

Modify the get_text_splitter method in RAG/src/chain_server/utils.py. Update it to incorporate your custom text splitter class.

def get_text_splitter():

   from langchain.text_splitter import RecursiveCharacterTextSplitter

   return RecursiveCharacterTextSplitter(
       chunk_size=get_config().text_splitter.chunk_size - 2,
       chunk_overlap=get_config().text_splitter.chunk_overlap
   )

Make sure the chunks created by the function have a smaller number of tokens than the context length of the embedding model.

Build and Start the Container

After you change the get_text_splitter function, build and start the container.

Navigate to the example directory.
```
cd RAG/examples/basic_rag/llamaindex
```
Build and deploy the microservice.
```
docker compose up -d --build
```