In the past month at Crunchy Data, we have talked to a steady stream of customers & community folks wanting to know how to augment their data platforms for AI. Fortunately, Postgres is equipped, nearly out of the box, and ready for the task of storing and querying this data. Through the magic of OpenAI’s API we can easily send data for classification and return the values.
Alongside this post, I created a sample code-base and data packet here. If you want to jump straight to the code and data, head straight there. We will reference back to the code and the data a few times throughout this post. This favors simplicity for understanding over comprehensiveness. We’ll follow up this post with comprehensive information later.
Using OpenAI in Postgres
For an AI-powered simple recommendation engine, the typical solution will pre-process the data with the following steps:
- Send data to an OpenAI model
- Receive a tokened set of values called embeddings
- Store embeddings in Postgres on the records
- Then, we run SQL queries that measure distance between embeddings.
Here’s a diagram of the data going out, going to OpenAI, and coming back as embeddings.
A diagram of how embeddings can be measured in terms of closeness.
AI with Postgres by example
In this example, we'll use Postgres + OpenAI as a backbone to solve a common problem: Choosing new food recipes that a picky child would enjoy? If we don't, then we'll have pizza all week and turn in crime fighting ninja turtles.
My child has a diverse, but picky palate. He enjoys lamb curry. He does not enjoy things like tomato soup yet, he loves tomato in paste form on a pizza. In this tutorial, we'll use a combination of AI data plus traditional query filters to try to nail down a new recipe that we'll try this week.
Let's see if we can over-engineer a way to make my picky eater happy!
Queries are powered by pg_vector
Powering our queries is a Postgres extension called pg_vector
. It gives Postgres the power to natively store, index, and query the OpenAI embeddings. An OpenAI embedding is a set of tokenized float values that allow us to measure the relatedness of text strings. These embeddings look like this:
pg_vector
can be downloaded from the Github repo and built locally. We have it ready to go on Crunchy Bridge and you could use our $10 / month plan for a quick test.
Loading Sample Data
For my example, I used the ArmedForcesRecipes.xml. We’ll process the XML and load it into a Postgres database using a parser file. The parser file looks like this:
connection_string = ENV['DATABASE_URL']
DB = Sequel.connect(connection_string)
DB.create_table? :recipes do
primary_key :id
String :name
String :description
Vector :embedding
end
recipes_xml = Nokogiri::XML(File.read('ArmedForcesRecipes.xml'))
for recipe_xml in recipes_xml.xpath('/*/recipe')
recipe_id = DB[:recipes].insert(
name: recipe_xml["description"],
description: recipe_xml.xpath(".//XML_MEMO1")[0]&.text
)
end
In the parser.rb
file, we build the recipes
table in our Postgres database. This table contains a name, description, and embedding columns. Notice that embedding
is a vector
type.
Table "public.recipes"
Column | Type | Collation | Nullable | Default
-------------+---------+-----------+----------+----------------------------------
id | integer | | not null | generated by default as identity
name | text | | |
description | text | | |
embedding | vector | | |
Indexes:
"recipes_pkey" PRIMARY KEY, btree (id)
The description
contains the long-form process, ingredients, and notes that are typically found in a recipe book. This is the data that we will send to OpenAI for classification. The contents of which are a bit long to list in this blog post, but you can see one by running SELECT description FROM recipes WHERE name = 'Pizza';
Classifying data with OpenAI
For our OpenAI classification, we use the recipes.description
column. This column is a fairly standardized set of text that contains the process of cooking the recipe as well as the ingredients.
Using the classifer.rb file, I can send the description to OpenAI API and receive back a data
object in the form of a tokenized set of values (called embedding). A sample embedding looks like this:
The classifier.rb
file looks like this:
openai = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])
while recipe = DB[:recipes].where(embedding: nil).exclude(description: nil).first do
submitted_value = recipe[:description].gsub(/\n/, ' ')
response = openai.embeddings(
parameters: {
model: 'text-embedding-ada-002',
input: submitted_value
}
)
begin
embedding_value = response["data"][0]["embedding"].to_s
DB[:recipes].where(id: recipe[:id]).update(embedding: embedding_value)
rescue
puts [$!, response].inspect
end
sleep 1.2
end
We run a loop that plucks out a single recipe without an embedding, then format the description to remove new line characters. Then, send the value to the OpenAI API using the openai
Ruby gem.
From the response, we pluck out the embedding_value
and store it with minimal modification directly into the embedding column.
Running this file will take a little while, as OpenAI has a 60 requests / minute limit. Thus, we wait 1.2 seconds between requests. After running this file, the embedding
column on the recipe will have data. If you run the following query, you'll find a set of float values: SELECT embedding FROM recipes WHERE embedding IS NOT NULL LIMIT 1;
. The response will look like this:
-[ RECORD 1 ]--------------------------------------------------------------------------------------------
embedding | [0.019761972,-0.0066673,0.0073532923,-0.0010382808,-0.007948044,-0.010725804,-0.0038692644...
Querying for similarities
The crux of finding similar recipes is using the <=>
operator, which is supplied by the pg_vector
extension. This operator will find the distance between the embedding values returned by OpenAI. So, will be doing something like the following, which will yield a distance as measured by a float.
recipe_1.embedding <=> recipe_2.embedding
To find similar recipes, we can find the recipes with the shortest distance between their embedding values.
Let’s find the closest recipe to pizza. There is a recipe in this dataset with just the term Pizza
with id = 431
. This is the Pizza recipe that my son loves, so we will use it as our reference recipe.
Let's find the closest recipes to Pizza:
SELECT
recipe_1.id,
recipe_1.name,
recipe_2.id,
recipe_2.name
FROM (SELECT * FROM recipes WHERE name = 'Pizza') recipe_1,
recipes AS recipe_2
ORDER BY recipe_1.embedding <=> recipe_2.embedding
LIMIT 4;
result:
id | name | id | name
-----+-------+-----+-------------------
431 | Pizza | 431 | Pizza
431 | Pizza | 433 | Pizza, 12 in, fzn
431 | Pizza | 126 | Chicken, parmesan
431 | Pizza | 435 | Pizza, treats
(4 rows)
With the query above, the closest recipe to pizza is more pizza. Let's use SQL's NOT LIKE
conditional statement to filter out other pizza:
SELECT
recipe_1.id,
recipe_1.name,
recipe_2.id,
recipe_2.name
FROM (SELECT * FROM recipes WHERE name = 'Pizza') recipe_1,
recipes AS recipe_2
WHERE LOWER(recipe_2.name) NOT LIKE '%pizza%'
ORDER BY recipe_1.embedding <=> recipe_2.embedding
LIMIT 4;
result:
id | name | id | name
-----+-------+-----+-------------------------------------
431 | Pizza | 126 | Chicken, parmesan
431 | Pizza | 31 | Beef, ground, hamburger, w/parmesan
431 | Pizza | 211 | Dish, eggplant, parmesan
431 | Pizza | 229 | Dish, lasagna
(4 rows)
In my scenario, I returned "Chicken, parmesan" as the best option! Excellent! I say "in my scenario" because we are using a few layers of non-deterministic outcomes -- data sent to OpenAI models may return different values, using the <=>
operator may return different distances.
Yes, I think my son would love chicken parmesan. When using these recipes, remember to reduce portion sizes, else you'll make food for all of your neighbors and neighbor's neighbors.
Side quest: querying for opposites
What is the opposite of a corn dog? We can flip the order from ASC
to DESC
and find the maximum distance.
SELECT
recipe_1.id,
recipe_1.name,
recipe_2.id,
recipe_2.name
FROM (SELECT * FROM recipes WHERE name = 'Corn Dog' LIMIT 1) recipe_1,
recipes AS recipe_2
WHERE
recipe_2.description IS NOT NULL
ORDER BY recipe_1.embedding <=> recipe_2.embedding DESC
LIMIT 2;
Result:
id | name | id | name
-----+----------+-----+--------------------------
165 | Corn Dog | 538 | Salad, green, tossed
165 | Corn Dog | 519 | Salad, chicken, tropical
(2 rows)
According to the embedding distance, the opposite of Corn Dog is “Salad, green, tossed”. Yeah, that makes sense.
AI in Postgres
The tools we have available today make AI approachable for nearly all scenarios. Over the past 20 years, I can think of many different scenarios where I wish I had access to OpenAI capabilities. pg_vector
has the first commits starting in 2021. So many things are coming together in a nice-to-use AI toolbox for the typical application developer. It’s awesome to see how Postgres will power developers to integrate AI capabilities in their products.
Hungry for more? Try pg_vector
on Crunchy Bridge today!
from Hacker News https://ift.tt/rjfbwvi
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.