๐Ÿฆ™ LlamaIndex RAG Cheatsheet

pip install llama-index llama-index-vector-stores-chroma chromadb

โš™๏ธ Settingsโ†’๐Ÿ“„ Loadโ†’โœ‚๏ธ Parseโ†’๐Ÿ—„๏ธ Indexโ†’๐Ÿ” Query Engineโ†’๐Ÿ’ฌ Answer

๐Ÿ“ฆ INSTALL & SETTINGS

terminal
pip install llama-index \
llama-index-vector-stores-chroma chromadb
settings (replaces ServiceContext)
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
ย 
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
ย 
# Global config singleton
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small"
)
โš ServiceContext is deprecated since v0.10. Use the Settings singleton.
โš Defaults to gpt-3.5-turbo + ada-002 if you skip Settings!

๐Ÿ“„ DOCUMENT READERS

llama_index.core
from llama_index.core import SimpleDirectoryReader
ย 
# Single file
docs = SimpleDirectoryReader(
input_files=["data/manual.txt"]
).load_data()
ย 
# Entire directory
docs = SimpleDirectoryReader(
input_dir="./data",
required_exts=[".txt", ".pdf"],
recursive=True
).load_data()

All Options

SimpleDirectoryReaderLocal files, auto-detect type
SimpleWebPageReaderScrape URLs
WikipediaReaderWikipedia articles by title
DatabaseReaderSQL databases via query
SlackReaderSlack channels / messages
NotionPageReaderNotion pages by ID
GithubRepositoryReaderEntire GitHub repos
PDFReader (llama-parse)Production PDFs with tables / images

โœ‚๏ธ NODE PARSERS (Splitters)

llama_index.core.node_parser
from llama_index.core.node_parser import SentenceSplitter
ย 
Settings.text_splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=50
)
ย 
# Or apply directly
nodes = SentenceSplitter(chunk_size=512).get_nodes_from_documents(docs)

All Options

SentenceSplitterDEFAULT โ€” sentence-aware boundaries
TokenTextSplitterExact token count (tiktoken)
SentenceWindowNodeParser1 sentence + surrounding window in metadata
HierarchicalNodeParserParent-child node tree (doc -> section -> chunk)
SemanticSplitterNodeParserEmbedding-based boundaries (expensive)
MarkdownNodeParserSplit by Markdown headings
HTMLNodeParserSplit by HTML tag structure
CodeSplitterSplit by function / class in source code
๐Ÿ’กLlamaIndex default: chunk_size=1024. Most RAG apps use 256-1024.

๐Ÿงฎ EMBEDDINGS

llama_index.embeddings.openai
from llama_index.embeddings.openai import OpenAIEmbedding
ย 
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small"
# dimensions=512, # optional reduction
# embed_batch_size=20, # batch API calls
)
ModelDims$/1M tok
text-embedding-3-small1536$0.02
text-embedding-3-large3072$0.13
text-embedding-ada-0021536$0.10 (legacy)
HuggingFace (local)variesFree

๐Ÿ—„๏ธ INDEX + VECTOR STORE

indexing pipeline
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
ย 
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("my_col")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx = StorageContext.from_defaults(vector_store=vector_store)
ย 
index = VectorStoreIndex.from_documents(
docs, storage_context=storage_ctx
)

Index Types

VectorStoreIndexDEFAULT โ€” similarity search
SummaryIndexSummarise all docs (no retrieval)
TreeIndexHierarchical summarisation tree
KeywordTableIndexKeyword-based lookup (BM25-like)
KnowledgeGraphIndexEntity-relation graph
DocumentSummaryIndexPer-doc summaries, then drill into chunks

Vector Store Backends

Default (in-memory)Quick prototyping, no deps
ChromaVectorStoreLocal, persistent, easy
QdrantVectorStoreCloud / self-hosted, production
PineconeVectorStoreManaged cloud, auto-scaling
FaissVectorStoreUltra-fast, large datasets

๐Ÿ” QUERY ENGINE

query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact"
)
ย 
response = query_engine.query("my question")
print(response.response) # answer text
print(response.source_nodes) # retrieved chunks

Response Modes

compactDEFAULT โ€” all chunks in one prompt
refineProcess chunks one-by-one, refine answer
tree_summarizeHierarchical tree summarisation
simple_summarizeConcatenate all + summarise in one shot
no_textReturn nodes only, no LLM call
accumulateSeparate answer per chunk, concatenate

Engine Types

as_query_engine()One-shot Q&A (default)
as_chat_engine()Conversational, maintains history
as_retriever()Retrieval only, no LLM generation

๐Ÿ’ฌ CHAT ENGINE

chat_engine = index.as_chat_engine(
chat_mode="condense_question"
)
ย 
# Multi-turn conversation
r1 = chat_engine.chat("What protocols?")
r2 = chat_engine.chat("Which has longest range?")
# ^ auto-rewrites: "Which of Wi-Fi, Zigbee,
# Z-Wave, BLE has the longest range?"

Chat Modes

condense_questionRewrites follow-ups using history
contextFresh retrieval every message
condense_plus_contextRewrite + fresh retrieval
reactReAct agent with tool-use
bestAuto-selects best mode (default)

๐Ÿ“ CUSTOM PROMPTS

llama_index.core.prompts
from llama_index.core.prompts import PromptTemplate
ย 
qa_prompt = PromptTemplate(
"""Answer from context only.
If unsure, say "I don't know."
ย 
Context:
-----
{context_str}
-----
ย 
Question: {query_str}
Answer:"""
)
ย 
query_engine = index.as_query_engine(
text_qa_template=qa_prompt,
similarity_top_k=5
)
๐Ÿ’กVariables: {context_str} = retrieved text, {query_str} = user question

โšก QUICK REFERENCE

Reload Without Re-indexing

index = VectorStoreIndex.from_vector_store(
vector_store=vector_store
)
query_engine = index.as_query_engine()

Override Settings Per-call

# Different LLM for one query
engine = index.as_query_engine(
llm=OpenAI(model="gpt-4o")
)
ย 
# Different embeddings for indexing
index = VectorStoreIndex.from_documents(
docs,
embed_model=OpenAIEmbedding(
model="text-embedding-3-large"
)
)

Source Attribution

response = query_engine.query("question")
for node in response.source_nodes:
print(f"Score: {node.score:.4f}")
print(f"Text: {node.text[:100]}")
LlamaIndex v0.12+ ยท Settings API ยท llama-index-llms-openai ยท llama-index-vector-stores-chroma ยท 2026