Let’s make LLMs generate JSON!
Miloš Švaňa
9 months ago 3.5.2024
In this article, we are going to talk about three tools that can, at least in theory, force any local LLM to produce structured output: LM Format Enforcer, Outlines, and Guidance. After a short description of each tool, we will evaluate their performance on a few test cases ranging from book recommendations to extracting information from HTML. And the best for the end, we will show you how forcing LLMs to produce a structured output can be used to solve a very common problem in many businesses: extracting structured records from free-form text.
Update: In the original version of the article that you could read prior to June 5, 2024, we mentioned we had difficulties generating JSON using Guidance. As it turns out, there was a bug in our code. On June 5, 2024, we have updated the article to also include results for Guidance, according to which Guidance is actually the fastest out of the three solutions we tried.
When we try to integrate large language models into our apps, we often struggle with the unpredictability of their output. We can’t be sure about what the answer to a request is going to look like, so it is very difficult to write code for processing this answer.
That is, until we somehow force the LLM to produce answers predictably, say as a JSON object that follows a predefined schema. We could try to use some prompt engineering magic, but the output format we specify is still not guaranteed. Isn’t there a way of solving this problem in a more deterministic way? OpenAI has a solution that works reasonably well: function calling. (It also has a JSON mode, but to be honest, I haven’t used it to this day).
But not everyone can or wants to use OpenAI’s GPT models or other solutions that require you to send data to someone else’s computer. If you are one of those people, open-weight local LLMs might be your go-to alternative. How can we solve the issue of output predictability if we are using one of such models?
In this article, we are going to talk about three tools that can, at least in theory, force any local LLM to produce structured JSON output: LM Format Enforcer, Outlines, and Guidance. We will evaluate their performance on a few test cases ranging from book recommendations to extracting information from HTML. Then we will show you how forcing LLMs to produce a structured output can be used to solve a very common problem in many businesses: extracting structured records from free-form text.
Available tools
LM Format Enforcer
https://github.com/noamgat/lm-format-enforcer
LM Format enforcer works by combining a character-level parser with a tokenizer prefix tree. The character level parser limits the choice of characters that can be added to the output in the next step based on the constraints we set. For example, when generating JSON the opening bracket { must be followed by a whitespace, a closing bracket, or by a quotation mark required to start generating a property name. The tokenizer prefix tree (also called trie) is built from the tokenizer of the language model. The library combines both data structures to filter tokens that the language model can generate in the next steps.
The library supports two types of generation constraints: JSON schemas and regular expressions. You can use it in combination with many LLM inference libraries, including llama.cpp:
class Book(BaseModel):
author: str
title: str
class BookList(BaseModel):
books: list[Book]
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""
def get_prompt(message: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
""" Formats the prompt so that it follows a specified chat format """
return f'[INST] <>\n{system_prompt}\n<>\n\n{message} [/INST]'
tokenizer_data = build_token_enforcer_tokenizer_data(model)
def llamacpp_with_character_level_parser(prompt: str, character_level_parser) -> str:
""" Runs inference with a specific logit processor """
logits_processors: llama_cpp.LogitsProcessorList | None = None
if character_level_parser:
logits_processors = llama_cpp.LogitsProcessorList([build_llamacpp_logits_processor(tokenizer_data, character_level_parser)])
output = some_llamacpp_model(prompt, logits_processor=logits_processors, max_tokens=768)
text: str = output['choices'][0]['text']
return text
prompt_base = f"Recommend me a 3 books on climate change in the following JSON schema: {BookList.schema_json()}:\n"
prompt_processed = get_prompt(prompt_base)
parser = lmformatenforcer.JsonSchemaParser(schema)
result = llamacpp_with_character_level_parser(prompt_formatted, parser)
print(result)
Outlines
https://github.com/outlines-dev/outlines
Outlines constraints the LLM output by using a finite state machine. The mechanism is explained in great detail in their arXiv paper. On of its advantages is that you can use it not only with open-weight local LLMs, but also with OpenAI’s GPT models. It also provides more additional options for constraining the output. You can force the model to choose from a predefined list of strings or to generate a specific data type such as an integer.
Here is how you can use Outlines to generate JSON:
model = outlines.models.LlamaCpp(some_llamacpp_model)
schema = json.dumps(BookList.schema())
generator = outlines.generate.json(model, schema)
result = generator(prompt_base)
print(result)
You can see that compared to LM Format Enforcer, Outlines adds more abstraction. This might make it more difficult to customize the behavior of the model but at the same time higher level of abstraction might be exactly what enables Outlines to work with non-local models or models built on top of other libraries than transformers, such as Llama.cpp.
Guidance
https://github.com/guidance-ai/guidance
Guidance is by far the most popular option out of the three listed here; at least if we look at the number of GitHub stars. It also seems to be the most flexible, as it gives you a lot of control over the LLM output. Its API is also unique in its similarity to string concatenation. This approach has a lot of advantages. You can build the output part-by-part by combining string literals with constrained or free generation. When it comes to defining constraints, Guidance gives you a plethora of options: regular expressions, data types, selecting from multiple choices, or even calling your own functions. It integrates well with local LLMs, as well as with models from OpenAI or Vertex AI from Google, including multimodal models with image-processing capabilities.
Here is how we can use Guidance with a llama.cpp model to generate json:
guidance_model = guidance.models.LlamaCpp(some_llamacpp_model)
with guidance.user():
lm = guidance_model + prompt
with guidance.assistant():
lm += guidance.json(schema=schema, name="answer")
print(lm["answer"])
Evaluation
To evaluate and compare the three libraries, we’ve prepared a small set of prompts that need to be answered by producing a JSON string according to the specified schema. Each prompt represents a different use case:
Prompt 1 – Use your own knowledge about the world to generate the answer:
Recommend me 3 books on climate change in the following JSON schema: {SCHEMA}:
Prompt 2 – What happens when there is no information about the schema?:
Recommend me 3 books on climate change. Answer:
Prompt 3 – Extract information from a semi-structured document:
Describe the following HTML: ----- HTML START ----- {HTML OF THE HACKER NEWS LOGIN PAGE} ----- HTML END ----- Here is an answer in the following JSON schema: {SCHEMA}:
Prompt 4 – Use user-provided data to guide creative generation:
Here are some of my notes: {NOTES ABOUT SYSTEMS THINKING IN MARKDOWN} Use these notes to generate 3 project ideas in the following JSON format: {SCHEMA}:
Prompt 5 – Extract numeric information from text
It is March 15, 2024. Apple's stock price behaved like this yesterday: - open price: 100.0 - close price: 103.1 - high price: 104.2 - low price: 99.8 Create a price record in the following JSON format: {SCHEMA}:
We used Pydantic to define the output schemas. For example, here are the classes representing the schema for the first two prompts:
class Book(BaseModel):
author: str
title: str
class BookList(BaseModel):
books: list[Book]
To generate the JSON schema, we simply call:
schema = BookList.schema()
We also wanted see how model choice influences the quality of results. We decided to try two popular open-source models: Mistral and Falcon-7b. In both cases we used a quantized version of the model and llama.cpp as our inference backend.
But what are we actually evaluating? We are interested in two measures:
- Output correctness: Do we get a valid JSON object and does it contain correct data?
- Time: How long does it take to generate the output? Is the generation time stable?
We ran all tests in a Kaggle notebook on two NVIDIA T4s with 16GB of VRAM each. In terms of software, we used Python 3.10 and the most recent versions of each library available on PyPI as of June 2024 with the exception of llama-cpp-python. Because of a few compatibility issues, we had to downgrade to version 0.2.75.
Results
So here are the results:
Mistral:
Book | Book no schema | HTML | Ideas | Stock data | |
---|---|---|---|---|---|
lm-format-enforcer | 5.93 | n/a | 30.73 | 14.57 | 3.43 |
run 1 | 6.03 | n/a | 20.04 | 15.20 | 3.41 |
run 2 | 7.03 | n/a | 40.09 | 13.90 | 3.44 |
run 3 | 4.72 | n/a | 32.06 | 14.60 | 3.45 |
outlines | 3.51 | 3.85 | 7.34 | 15.12 | 2.98 |
run 1 | 3.03 | 3.21 | 7.44 | 11.20 | 3.04 |
run 2 | 3.57 | 3.79 | 5.60 | 20.07 | 2.99 |
run 3 | 3.94 | 4.56 | 8.98 | 14.10 | 2.91 |
guidance | 2.38 | n/a | 4.85 | 15.03 | 2.39 |
run 1 | 2.38 | n/a | 4.79 | 15.20 | 2.38 |
run 2 | 2.36 | n/a | 5.01 | 14.80 | 2.37 |
run 3 | 2.40 | n/a | 4.74 | 15.10 | 2.43 |
Falcon:
Book | Book no schema | HTML | Ideas | Stock data | |
---|---|---|---|---|---|
lm-format-enforcer | 0.62 | n/a | n/a | 5.83 | 4.03 |
run 1 | n/a | n/a | n/a | 5.58 | n/a |
run 2 | 3.85 | n/a | 5.55 | 4.61 | 3.97 |
run 3 | n/a | n/a | 5.60 | 7.29 | 4.09 |
outlines | 1.99 | 5.35 | 2.71 | 3.20 | 3.08 |
run 1 | 2.76 | 7.25 | 2.65 | 3.44 | 3.20 |
run 2 | 0.90 | 1.40 | 3.08 | 2.39 | 3.04 |
run 3 | 2.32 | 7.39 | 2.40 | 3.76 | 3.01 |
guidance | 1.48 | n/a | n/a | n/a | 1.87 |
run 1 | 1.56 | n/a | n/a | n/a | 1.90 |
run 2 | 1.47 | n/a | n/a | n/a | 1.81 |
run 3 | 1.42 | n/a | n/a | n/a | 1.90 |
The tables contain the time of each run in seconds, as well as the average time for each model and prompt. The background of each column represents information about correctness. White means that the output was correct. Yellow means that there were small errors. Orange means that the output was parsable, but the values in the JSON object were not correct. Finally, red means that the output wasn’t a valid JSON object.
Right off the bat, we can see several interesting things: first, Mistral seems to be doing much better than Falcon. Falcon has a lot of orange and red columns. Many generated outputs resemble a documentation documentation example. For example, here one of the output for Prompt 4:
{
'ideas': [
{
'title': 'Idea 1',
'description': 'Idea 1 description'
},
{
'title': 'Idea 2',
'description': 'Idea 2 description'
},
{
'title': 'Idea 3',
'description': 'Idea 3 description'
}
]
}
Yes, technically correct, but not exactly what we wanted.
What about the differences between LM Format Enforcer, Outlines, and Guidance? The results favor the Guidelines in terms of speed, but Outlines is the only library that was capable of producing a correct answer when information about the desired JSON schema was not present in the prompt. Both LM Format Enforcer and Guidance usually stopped generating after a few tokens:
{"books
One small issue we encountered when examining Outlines is has to do with the date in Prompt 5. We are saying that the data presented in the prompt is for yesterday. So ideally we would like to see March 14 and not March 15 in the generated JSON object. In 2 out of 3 runs we got the incorrect date. However, this issue is more likely related to the imperfections of the language model, not Outlines. Outlines was the only library that consistently generated valid JSON when combined with Falcon. However, when testing prompts 1 and 2 we saw a lot of hallucinated book titles.
What conclusion can we draw here? First, model choice matters. As we have seen, Mistral performs much better than Falcon. The scope of our experiments was somewhat limited though and we are curious about how the libraries would behave if we used a more powerful model, say Mixtral. Second, if we were to choose a library for constrained generation for our next project, we would go with Outlines or Guidance. In most circumstances we would likely prefer Guidance. It is faster, it has high level of output correctness, and adding the desired JSON schema is not a big deal.
Extracting structured data from text
We’ve talked about forcing language models to produce JSON according to a provided schema. But what is it good for? There are of course many use cases, function calling is one of the most well-known. Here, we want to talk about a different application: extracting structured data from text. This feature can be useful in many situations.
The situation we picked is extracting information from listings on Craigslist. Let’s talk about laptops, for example. We could extract information about the laptop’s RAM, CPU manufacturer, screen size, or drive capacity. This information could be then used to create filters or analyze the listings. Let’s have a look at how we can turn the text of the listing into JSON with Mistral and LM Format Enforcer.
Let’s start by defining a listing schema with Pydantic:
class Laptop(pydantic.BaseModel):
cpu_manufacturer: typing.Literal["apple", "intel", "amd"] = pydantic.Field(
description="Manufacturer of the CPU (do not confuse with the manufacturer of the laptop)")
cpu_model: str = pydantic.Field(description="Model of the CPU, for example M3 or i7-8550U")
ram_size_gb: int = pydantic.Field(description="Size of the laptop's RAM in GB")
hdd_size_gb: int = pydantic.Field(description="Size of the laptop's drive or HDD in GB")
screen_size_in: float = pydantic.Field(description="Size of the laptop's screen in inches")
price_dollars: float
Next, we need to define a prompt template:
You are an expert on extracting structured information from text content. Here is a laptop listing: ----- Listing start ----- {listing} ----- Listing end ----- Extract information about this laptop in a structured form. Try to derive information indirectly if not present. For example, screen size might be a part of the laptop model name. Drive/HDD size might be in TB. Be careful about confusing component manufacturers with the laptop manufacturer. Provide an answer in the following JSON schema (skip fields that you can't fill): {schema}:
Now let’s process a real listing taken from Craigslist:
Dell latitude 7480 - $120 (milpitas) Laptop model 7480 CPU I7, 16gb ram DDR4, and 256 SSD M2 and charger too. It's working and good condition. I sell it because my daughter does not need it. Please email me.
We can simply run the following piece of code:
model = llama_cpp.Llama.from_pretrained(repo_id=MODEL_REPO, filename=MODEL_FILE, n_ctx=4096, n_gpu_layers=-1)
guidance_model = guidance.models.LlamaCpp(model, echo=False)
prompt_for_listing = prompt.format(listing=listing, schema=Laptop.schema())
with guidance.user():
lm = guidance_model + prompt_for_listing
with guidance.assistant():
lm += guidance.json(schema=Laptop.schema(), name="answer", temperature=0.0)
print(Laptop(**json.loads(lm["answer"])))
And the output looks like this:
cpu_manufacturer='intel' cpu_model='I7' ram_size_gb=16 hdd_size_gb=256 screen_size_in=0.0 price_dollars=120.0
We were able to correctly extract all pieces of information. Yet the output is not perfect. Instead of returning a screen size of 0 when this information is not present in the listing, it might be better to skip the attribute altogether. Maybe tweaking the prompt a bit could help us fix this issue. We already had to perform such tweaking to get more reliable results for cpu_manufacturer and hdd_size_gb. Before our modifications, the model sometimes confused the laptop manufacturer with the CPU manufacturer and had trouble extracting the HDD size if it was stated in terabytes.
If you are interested in more details, you can find the full example on Kaggle.
Conclusion
We have shown that producing JSON is not exclusively a domain of commercial cloud-based LLMs such as GPT. Even small models such as Mistral can handle such tasks when combined with a proper 3rd party library. However, not all models and not all libraries are equally good. We have seen Falcon or the Outlines library fail more often than not. Choice matters.