Free trial

Let’s make LLMs generate JSON!

In this article, we are going to talk about three tools that can, at least in theory, force any local LLM to produce structured output: LM Format Enforcer, Outlines, and Guidance. After a short description of each tool, we will evaluate their performance on a few test cases ranging from book recommendations to extracting information from HTML. And the best for the end, we will show you how forcing LLMs to produce a structured output can be used to solve a very common problem in many businesses: extracting structured records from free-form text.


When we try to integrate large language models into our apps, we often struggle with the unpredictability of their output. We can’t be sure about what the answer to a request is going to look like, so it is very difficult to write code for processing this answer.

That is, until we somehow force the LLM to produce answers predictably, say as a JSON object that follows a predefined schema. We could try to use some prompt engineering magic, but the output format we specify is still not guaranteed. Isn’t there a way of solving this problem in a more deterministic way? OpenAI has a solution that works reasonably well: function calling. (It also has a JSON mode, but to be honest, I haven’t used it to this day).

But not everyone can or wants to use OpenAI’s GPT models or other solutions that require you to send data to someone else’s computer. If you are one of those people, open-weight local LLMs might be your go-to alternative. How can we solve the issue of output predictability if we are using one of such models?

In this article, we are going to talk about three tools that can, at least in theory, force any local LLM to produce structured output: LM Format Enforcer, Outlines, and Guidance. After a short description of each tool, we will evaluate their performance on a few test cases ranging from book recommendations to extracting information from HTML. And the best for the end, we will show you how forcing LLMs to produce a structured output can be used to solve a very common problem in many businesses: extracting structured records from free-form text.

Just one last note before we move to the interesting stuff: structured output can take many forms. But in this article, we want to focus specifically on JSON, because it is well-known and all popular programming languages have tools for parsing this format.

Available tools

LM Format Enforcer

LM Format enforcer works by combining a character-level parser with a tokenizer prefix tree. The character level parser limits the choice of characters that can be added to the output in the next step based on the constraints we set. For example, when generating JSON the opening bracket { must be followed by a whitespace, a closing bracket, or by a quotation mark required to start generating a property name. The tokenizer prefix tree (also called trie) is built from the tokenizer of the language model. The library combines both data structures to filter tokens that the language model can generate in the next steps.

The library supports two types of generation constraints: JSON schemas and regular expressions. One of the greatest advantages of this library is that it doesn’t introduce unnecessary abstractions. It integrates well with existing types and functions of the Transformers library. It can be used very easily with a Transformers pipeline and Pydantic, for example:

class Book(BaseModel):
    author: str
    title: str

class BookList(BaseModel):
    books: list[Book]

p = pipeline('text-generation', model=MODEL_NAME, device_map='auto')
prompt = f"Recommend me a 3 books on climate change in the following JSON schema: {BookList.schema_json()}:\n"

Here starts the LM Format Enforcer part
parser = JsonSchemaParser(BookList.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(p.tokenizer, parser)
output_dict = p(prompt, prefix_allowed_tokens_fn=prefix_function, max_new_tokens=512)
result = output_dict[0]['generated_text'][len(prompt):]


Outlines constraints the LLM output by using a finite state machine. The mechanism is explained in great detail in their arXiv paper. What’s interesting about this library is that it can integrate not only with open-weight local LLMs but also with OpenAI’s models. It also provides some additional options for constraining the output, besides JSON and regular expressions, you can also force the model to choose from a predefined list of strings or to generate an output of a specific data type, such as an integer.

Here is how you can use Outlines to generate JSON:

model = outlines.models.transformers(MODEL_NAME)
schema = json.dumps(BookList.schema())
generator = outlines.generate.json(model, schema)
result = generator(prompt)

You can see that compared to LM Format Enforcer, Outlines adds more abstraction. This might make it more difficult to customize the behavior of the model but at the same time higher level of abstraction might be exactly what enables Outlines to work with non-local models or models built on top of other libraries than transformers, such as Llama.cpp.


Guidance is by far the most popular option out of the three listed here; at least if we look at the number of GitHub stars. It also seems to be the most flexible as it gives you a lot of control over the LLM output. Its API is also unique in its similarity to string concatenation. This approach has a lot of advantages. You can build the output part-by-part by combining string literals with constrained or free generation. When it comes to defining constraints, Guidance gives you a plethora of options: regular expressions, data types, selecting from multiple choices, or even calling your own functions. It integrates well with local LLMs, as well as with models from OpenAI or Vertex AI from Google, including multimodal models with image-processing capabilities.

There is an important catch though. Generating JSON is quite difficult. In theory, you can do it manually by combining other constraints, but this takes a lot of time. In its most recent version, Guidance added a json() function, but it doesn’t work well at all. In our experiments, it always got stuck in an infinite loop.

Guidance might be a great tool if you want to do something else than producing JSON objects following a schema. But since JSON is the main focus of this article, we decided to omit Guidance from further evaluation.

Evaluation: LM Format Enforcer vs. Outlines

We are down to two libraries. It’s for a side-by-side comparison. To do that we’ve come up with a small set of prompts that need to be answered by producing a JSON string according to the specified schema. Here they are:

Prompt 1:

Recommend me 3 books on climate change in the following JSON schema: {SCHEMA}:

Prompt 2:

Recommend me 3 books on climate change. Answer:

Prompt 3:

Describe the following HTML:
----- HTML START -----
----- HTML END -----
Here is an answer in the following JSON schema: {SCHEMA}:

Prompt 4:

Here are some of my notes:


Use these notes to generate 3 project ideas in the following JSON format:

Prompt 5:

It is March 15, 2024. Apple's stock price behaved like this yesterday:
- open price: 100.0
- close price: 103.1
- high price: 104.2
- low price: 99.8

Create a price record in the following JSON format: {SCHEMA}:

We wanted to cover many different use cases. At the same time, we wanted to see whether the presence of the schema in the prompt affects the results. That’s why we have Prompt 1 and Prompt 2. One contains the JSON schema, and the other doesn’t.

When it comes to the schemas themselves, we defined them in Pydantic. For example, here are the classes representing the schema for the first two prompts:

class Book(BaseModel):
    author: str
    title: str

class BookList(BaseModel):
    books: list[Book]

To generate the JSON schema, we simply call:

schema = BookList.schema()

Next, we wanted to see how the two libraries work with different models. We decided to evaluate their performance on two popular open-weight LLMs: Mistral and Falcon-7b. So we have 2 models, 2 libraries, and 5 prompts: 20 different configurations in total.

But what are we actually evaluating? There are two parameters we are interested in: output correctness and time. From the correctness standpoint, we need an output that is a valid JSON object and that contains correct information. As for time, we want to know not only a simple number of seconds but also how stable it is. That’s why we ran each of the 20 configurations three times.

As for our hardware stack, we decided to keep things simple and run the tests in a Kaggle notebook on two NVIDIA T4s with 16GB of VRAM each. We used Python 3.10 and the most recent versions of each library available on PyPI in March 2024.


So here are the results:


Book Book no schema HTML Ideas Stock data
lm-format-enforcer 7.48 n/a 20.47 31.77 5.38
run 1 8.84 n/a 20.40 31.70 5.41
run 2 6.80 n/a 20.50 31.80 5.36
run 3 6.79 n/a 20.50 31.80 5.37
outlines 6.57 6.82 33.40 19.67 n/a
run 1 7.94 6.17 37.10 17.90 n/a
run 2 6.82 8.16 33.70 16.40 n/a
run 3 4.94 6.14 29.40 24.70 n/a


Book Book no schema HTML Ideas Stock data
lm-format-enforcer 7.84 n/a 54.53 8.39 6.01
run 1 9.07 n/a 58.80 8.36 5.99
run 2 7.28 n/a 52.20 8.35 6.02
run 3 7.18 n/a 52.60 8.35 6.02
outlines 8.63 4.94 3.10 n/a n/a
run 1 8.81 0.42 11.30 7.72 n/a
run 2 2.98 0.60 n/a n/a n/a
run 3 14.10 11.80 n/a n/a n/a

Each of the two tables represents the results for one model. The tables contain the time of each run in seconds, as well as the average time for each model and prompt. The background of each column represents information about correctness. White means that the output was correct, and yellow means that there were small errors. Orange means that the output was parsable, but the values in the JSON object were not correct. Finally, red means that the output wasn’t a valid JSON string.

Right off the bat, we can see several interesting things: first, Mistral seems to be doing much better than Falcon. Falcon has a lot of orange columns. It generated a parsable JSON, but the values looked like they were taken from a documentation example. For instance, here is the output for Prompt 4:

    'ideas': [
             'title': 'Idea 1', 
             'description': 'Idea 1 description'
             'title': 'Idea 2', 
             'description': 'Idea 2 description'
             'title': 'Idea 3', 
             'description': 'Idea 3 description'

Yes, technically correct, but not exactly what we wanted.

What about the differences between LM Format Enforcer and Outlines? The results favor the former. But neither option was perfect. LM Format Enforcer had trouble with Prompt 2. It was the only prompt in our dataset that didn’t include the schema description. It looks like this omission has consequences. We received the following string:


Things started well, but quite soon the model decided to stop generating. We observed this issue with both Mistral and Falcon.

The second issue we encountered with LM Format Enforcer was minor in comparison. Have a look at the last prompt. We are saying that the data is for yesterday. So ideally we would like to see March 14 and not March 15 in the generated JSON object. This didn’t happen. But it’s hard to blame LM Format Enforcer for this error. This kind of calculation can be difficult even for larger models.

Let’s move to Outlines. In general, the output was much more random. The answers differed quite a lot between runs. The generation time was more variable too. Sometimes we even stopped the generation process manually, because we didn’t see any response after almost 10 minutes. These issues became apparent especially when we combined Outlines with Falcon.

Outlines also hallucinated a lot. While LM Format Enforcer always returned existing books when responding to Prompt 1, Outlines generated valid-sounding titles that don’t actually exist. The same is true for idea generation. When combined with Mistral, the output was much better than the example above resembling an extract from the documentation. But it wasn’t entirely correct either. The ideas didn’t make much sense. We are not sure what causes these problems. Maybe Outlines sets the generation temperature to much greater values. In our experiments, we always stuck with the default.

The final issue had to do with Prompt 5. Outlines failed at producing a valid JSON object. Unfortunately, its API didn’t allow us to see the raw string, so we couldn’t figure out what went wrong. The prompt differs from all others in asking the model to produce float and date values instead of just simple strings. Maybe this difference lies at the root of this problem.

What conclusion can we draw here? It seems like the combination of LM Format Enforcer and Mistral performed the best. Its output was the most correct, and time-wise it was quite comparable with other combinations. But we also learned that it might be a good idea to always include the schema description in the prompt.

We are also curious about how the libraries would behave if we used a more powerful model, say Mixtral. Unfortunately, we were limited by the amount of available VRAM.

Extracting structured data from text

We’ve talked about forcing language models to produce JSON according to a provided schema. But what is it good for? There are of course many use cases, function calling is one of the most well-known. Here, we want to talk about a different application: extracting structured data from text. This feature can be useful in many situations.

The situation we picked is extracting information from listings on Craigslist. Let’s talk about laptops, for example. We could extract information about the laptop’s RAM, CPU manufacturer, screen size, or drive capacity. This information could be then used to create filters or analyze the listings. Let’s have a look at how we can turn the text of the listing into JSON with Mistral and LM Format Enforcer.

Let’s start by defining a listing schema with Pydantic:

class Laptop(pydantic.BaseModel):
    cpu_manufacturer: typing.Literal["apple", "intel", "amd"] = pydantic.Field(
        description="Manufacturer of the CPU (do not confuse with the manufacturer of the laptop)")
    cpu_model: str = pydantic.Field(description="Model of the CPU, for example M3 or i7-8550U")
    ram_size_gb: int = pydantic.Field(description="Size of the laptop's RAM in GB")
    hdd_size_gb: int = pydantic.Field(description="Size of the laptop's drive or HDD in GB")
    screen_size_in: float = pydantic.Field(description="Size of the laptop's screen in inches")
    price_dollars: float

Next, we need to define a prompt template:

You are an expert on extracting structured information from text content.
Here is a laptop listing:

----- Listing start -----
----- Listing end -----

Extract information about this laptop in a structured form. Try to derive information indirectly if not present.
For example, screen size might be a part of the laptop model name. Drive/HDD size might be in TB. Be careful about confusing component
manufacturers with the laptop manufacturer.

Provide an answer in the following JSON schema (skip fields that you can't fill): {schema}:

Now let’s process a real listing taken from Craigslist:

Dell latitude 7480 - $120 (milpitas)

Laptop model 7480 CPU I7, 16gb ram DDR4, and 256 SSD M2 and charger too. It's working and good condition. I sell it because my daughter does not need it. Please email me.

We can simply run the following piece of code:

pipeline = transformers.pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2", device_map="auto")
prompt_for_listing = prompt.format(listing=listing, schema=Laptop.schema())
parser = JsonSchemaParser(Laptop.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(pipeline.tokenizer, parser)
output_dict = pipeline(prompt_for_listing, prefix_allowed_tokens_fn=prefix_function, max_new_tokens=256)
result = output_dict[0]['generated_text'][len(prompt_for_listing):]

And the output looks like this:


We were able to correctly extract all pieces of information. Yet the output is not perfect. Instead of returning a screen size of 0 when this information is not present in the listing, it might be better to skip the attribute altogether. This could be achieved by tweaking the scheme and/or the prompt. We already had to perform such tweaking to get more reliable results for cpu_manufacturer and hdd_size_gb. Before our modifications, the model sometimes confused the laptop manufacturer with the CPU manufacturer and had trouble extracting the HDD size if it was stated in terabytes.

If you are interested in more details, you can find the full example on Kaggle.


We have shown that producing JSON is not exclusively a domain of commercial cloud-based LLMs such as GPT. Even small models such as Mistral can handle such tasks when combined with a proper 3rd party library. However, not all models and not all libraries are equally good. We have seen Falcon or the Outlines library fail more often than not. Choice matters.

json llm tools

Leave a Reply

Related articles


Let’s make LLMs generate JSON!

In this article, we are going to talk about three tools that can, at least in theory, force any local LLM to produce structured output: LM Format Enforcer, Outlines, and Guidance. After a short description of each tool, we will evaluate their performance on a few test cases ranging from book recommendations to extracting information from HTML. And the best for the end, we will show you how forcing LLMs to produce a structured output can be used to solve a very common problem in many businesses: extracting structured records from free-form text.

Notiondipity: What I learned about browser extension development

Me and many of my colleagues at profiq use Notion for note-taking and work organization. Our workspaces contain a lot of knowledge about our work, plans, or the articles or books we read. At some point, a thought came to my mind: couldn’t we use all this knowledge to come up with project ideas suited to our skills and interests?

From ChatGPT to Smart Agents: The Next Frontier in App Integration

It has been over a year since OpenAI introduced ChatGPT and brought the power of AI and large language models (LLMs) to the average consumer. But we could argue that introducing APIs for seamlessly integrating large language models into apps developed by companies and independent hackers all over the world can be the true game changer in the long term. Developers are having heated discussions about how we can utilize this technology to develop truly useful apps that provide real value instead of just copying what OpenAI does. We want to contribute to this discussion by showing you how we think about developing autonomous agents at profiq. But first a bit of background.