🦙 LlamaIndex RAG Cheatsheet

pip install llama-index llama-index-vector-stores-chroma chromadb

⚙️ Settings→📄 Load→✂️ Parse→🗄️ Index→🔍 Query Engine→💬 Answer

📦 INSTALL & SETTINGS

terminal

pip install llama-index \

llama-index-vector-stores-chroma chromadb

settings (replaces ServiceContext)

import os

os.environ["OPENAI_API_KEY"] = "sk-..."

from llama_index.core import Settings

from llama_index.llms.openai import OpenAI

from llama_index.embeddings.openai import OpenAIEmbedding

# Global config singleton

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)

Settings.embed_model = OpenAIEmbedding(

model="text-embedding-3-small"

)

⚠ServiceContext is deprecated since v0.10. Use the Settings singleton.

⚠Defaults to gpt-3.5-turbo + ada-002 if you skip Settings!

📄 DOCUMENT READERS

llama_index.core

from llama_index.core import SimpleDirectoryReader

# Single file

docs = SimpleDirectoryReader(

input_files=["data/manual.txt"]

).load_data()

# Entire directory

docs = SimpleDirectoryReader(

input_dir="./data",

required_exts=[".txt", ".pdf"],

recursive=True

).load_data()

All Options

SimpleDirectoryReaderLocal files, auto-detect type

SimpleWebPageReaderScrape URLs

WikipediaReaderWikipedia articles by title

DatabaseReaderSQL databases via query

SlackReaderSlack channels / messages

NotionPageReaderNotion pages by ID

GithubRepositoryReaderEntire GitHub repos

PDFReader (llama-parse)Production PDFs with tables / images

✂️ NODE PARSERS (Splitters)

llama_index.core.node_parser

from llama_index.core.node_parser import SentenceSplitter

Settings.text_splitter = SentenceSplitter(

chunk_size=512,

chunk_overlap=50

)

# Or apply directly

nodes = SentenceSplitter(chunk_size=512).get_nodes_from_documents(docs)

All Options

SentenceSplitterDEFAULT — sentence-aware boundaries

TokenTextSplitterExact token count (tiktoken)

SentenceWindowNodeParser1 sentence + surrounding window in metadata

HierarchicalNodeParserParent-child node tree (doc -> section -> chunk)

SemanticSplitterNodeParserEmbedding-based boundaries (expensive)

MarkdownNodeParserSplit by Markdown headings

HTMLNodeParserSplit by HTML tag structure

CodeSplitterSplit by function / class in source code

💡LlamaIndex default: chunk_size=1024. Most RAG apps use 256-1024.

🧮 EMBEDDINGS

llama_index.embeddings.openai

from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(

model="text-embedding-3-small"

# dimensions=512, # optional reduction

# embed_batch_size=20, # batch API calls

)

Model	Dims	$/1M tok
text-embedding-3-small	1536	$0.02
text-embedding-3-large	3072	$0.13
text-embedding-ada-002	1536	$0.10 (legacy)
HuggingFace (local)	varies	Free

🗄️ INDEX + VECTOR STORE

indexing pipeline

import chromadb

from llama_index.vector_stores.chroma import ChromaVectorStore

from llama_index.core import VectorStoreIndex, StorageContext

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.get_or_create_collection("my_col")

vector_store = ChromaVectorStore(chroma_collection=collection)

storage_ctx = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(

docs, storage_context=storage_ctx

)

Index Types

VectorStoreIndexDEFAULT — similarity search

SummaryIndexSummarise all docs (no retrieval)

TreeIndexHierarchical summarisation tree

KeywordTableIndexKeyword-based lookup (BM25-like)

KnowledgeGraphIndexEntity-relation graph

DocumentSummaryIndexPer-doc summaries, then drill into chunks

Vector Store Backends

Default (in-memory)Quick prototyping, no deps

ChromaVectorStoreLocal, persistent, easy

QdrantVectorStoreCloud / self-hosted, production

PineconeVectorStoreManaged cloud, auto-scaling

FaissVectorStoreUltra-fast, large datasets

🔍 QUERY ENGINE

query_engine = index.as_query_engine(

similarity_top_k=5,

response_mode="compact"

)

response = query_engine.query("my question")

print(response.response) # answer text

print(response.source_nodes) # retrieved chunks

Response Modes

compactDEFAULT — all chunks in one prompt

refineProcess chunks one-by-one, refine answer

tree_summarizeHierarchical tree summarisation

simple_summarizeConcatenate all + summarise in one shot

no_textReturn nodes only, no LLM call

accumulateSeparate answer per chunk, concatenate

Engine Types

as_query_engine()One-shot Q&A (default)

as_chat_engine()Conversational, maintains history

as_retriever()Retrieval only, no LLM generation

💬 CHAT ENGINE

chat_engine = index.as_chat_engine(

chat_mode="condense_question"

)

# Multi-turn conversation

r1 = chat_engine.chat("What protocols?")

r2 = chat_engine.chat("Which has longest range?")

# ^ auto-rewrites: "Which of Wi-Fi, Zigbee,

# Z-Wave, BLE has the longest range?"

Chat Modes

condense_questionRewrites follow-ups using history

contextFresh retrieval every message

condense_plus_contextRewrite + fresh retrieval

reactReAct agent with tool-use

bestAuto-selects best mode (default)

📝 CUSTOM PROMPTS

llama_index.core.prompts

from llama_index.core.prompts import PromptTemplate

qa_prompt = PromptTemplate(

"""Answer from context only.

If unsure, say "I don't know."

Context:

-----

{context_str}

-----

Question: {query_str}

Answer:"""

)

query_engine = index.as_query_engine(

text_qa_template=qa_prompt,

similarity_top_k=5

)

💡Variables: {context_str} = retrieved text, {query_str} = user question

⚡ QUICK REFERENCE

Reload Without Re-indexing

index = VectorStoreIndex.from_vector_store(

vector_store=vector_store

)

query_engine = index.as_query_engine()

Override Settings Per-call

# Different LLM for one query

engine = index.as_query_engine(

llm=OpenAI(model="gpt-4o")

)

# Different embeddings for indexing

index = VectorStoreIndex.from_documents(

docs,

embed_model=OpenAIEmbedding(

model="text-embedding-3-large"

)

Source Attribution

response = query_engine.query("question")

for node in response.source_nodes:

print(f"Score: {node.score:.4f}")

print(f"Text: {node.text[:100]}")

LlamaIndex v0.12+ · Settings API · llama-index-llms-openai · llama-index-vector-stores-chroma · 2026