Friday, February 24, 2023

What's Postgres got to do with AI?

In the past month at Crunchy Data, we have talked to a steady stream of customers & community folks wanting to know how to augment their data platforms for AI. Fortunately, Postgres is equipped, nearly out of the box, and ready for the task of storing and querying this data. Through the magic of OpenAI’s API we can easily send data for classification and return the values.

Alongside this post, I created a sample code-base and data packet here. If you want to jump straight to the code and data, head straight there. We will reference back to the code and the data a few times throughout this post. This favors simplicity for understanding over comprehensiveness. We’ll follow up this post with comprehensive information later.

Using OpenAI in Postgres

For an AI-powered simple recommendation engine, the typical solution will pre-process the data with the following steps:

  1. Send data to an OpenAI model
  2. Receive a tokened set of values called embeddings
  3. Store embeddings in Postgres on the records
  4. Then, we run SQL queries that measure distance between embeddings.

Here’s a diagram of the data going out, going to OpenAI, and coming back as embeddings.

Untitled

A diagram of how embeddings can be measured in terms of closeness.

Untitled

AI with Postgres by example

In this example, we'll use Postgres + OpenAI as a backbone to solve a common problem: Choosing new food recipes that a picky child would enjoy? If we don't, then we'll have pizza all week and turn in crime fighting ninja turtles.

My child has a diverse, but picky palate. He enjoys lamb curry. He does not enjoy things like tomato soup yet, he loves tomato in paste form on a pizza. In this tutorial, we'll use a combination of AI data plus traditional query filters to try to nail down a new recipe that we'll try this week.

Let's see if we can over-engineer a way to make my picky eater happy!

Queries are powered by pg_vector

Powering our queries is a Postgres extension called pg_vector . It gives Postgres the power to natively store, index, and query the OpenAI embeddings. An OpenAI embedding is a set of tokenized float values that allow us to measure the relatedness of text strings. These embeddings look like this:

Untitled

pg_vector can be downloaded from the Github repo and built locally. We have it ready to go on Crunchy Bridge and you could use our $10 / month plan for a quick test.

Loading Sample Data

For my example, I used the ArmedForcesRecipes.xml. We’ll process the XML and load it into a Postgres database using a parser file. The parser file looks like this:

connection_string = ENV['DATABASE_URL']
DB = Sequel.connect(connection_string)

DB.create_table? :recipes do
  primary_key :id
  String :name
  String :description
  Vector :embedding
end

recipes_xml = Nokogiri::XML(File.read('ArmedForcesRecipes.xml'))

for recipe_xml in recipes_xml.xpath('/*/recipe')
  recipe_id = DB[:recipes].insert(
    name: recipe_xml["description"],
    description: recipe_xml.xpath(".//XML_MEMO1")[0]&.text
  )
end

In the parser.rb file, we build the recipes table in our Postgres database. This table contains a name, description, and embedding columns. Notice that embedding is a vector type.

Table "public.recipes"
   Column    |  Type   | Collation | Nullable |             Default
-------------+---------+-----------+----------+----------------------------------
 id          | integer |           | not null | generated by default as identity
 name        | text    |           |          |
 description | text    |           |          |
 embedding   | vector  |           |          |
Indexes:
    "recipes_pkey" PRIMARY KEY, btree (id)

The description contains the long-form process, ingredients, and notes that are typically found in a recipe book. This is the data that we will send to OpenAI for classification. The contents of which are a bit long to list in this blog post, but you can see one by running SELECT description FROM recipes WHERE name = 'Pizza';

Classifying data with OpenAI

For our OpenAI classification, we use the recipes.description column. This column is a fairly standardized set of text that contains the process of cooking the recipe as well as the ingredients.

Using the classifer.rb file, I can send the description to OpenAI API and receive back a data object in the form of a tokenized set of values (called embedding). A sample embedding looks like this:

The classifier.rb file looks like this:

openai = OpenAI::Client.new(access_token: ENV['OPENAI_API_KEY'])

while recipe = DB[:recipes].where(embedding: nil).exclude(description: nil).first do
  submitted_value = recipe[:description].gsub(/\n/, ' ')

  response = openai.embeddings(
    parameters: {
      model: 'text-embedding-ada-002',
      input: submitted_value
    }
  )

  begin
    embedding_value = response["data"][0]["embedding"].to_s
    DB[:recipes].where(id: recipe[:id]).update(embedding: embedding_value)
  rescue
    puts [$!, response].inspect
  end

  sleep 1.2
end

We run a loop that plucks out a single recipe without an embedding, then format the description to remove new line characters. Then, send the value to the OpenAI API using the openai Ruby gem.

From the response, we pluck out the embedding_value and store it with minimal modification directly into the embedding column.

Running this file will take a little while, as OpenAI has a 60 requests / minute limit. Thus, we wait 1.2 seconds between requests. After running this file, the embedding column on the recipe will have data. If you run the following query, you'll find a set of float values: SELECT embedding FROM recipes WHERE embedding IS NOT NULL LIMIT 1;. The response will look like this:

-[ RECORD 1 ]--------------------------------------------------------------------------------------------
embedding | [0.019761972,-0.0066673,0.0073532923,-0.0010382808,-0.007948044,-0.010725804,-0.0038692644...

Querying for similarities

The crux of finding similar recipes is using the <=> operator, which is supplied by the pg_vector extension. This operator will find the distance between the embedding values returned by OpenAI. So, will be doing something like the following, which will yield a distance as measured by a float.

recipe_1.embedding <=> recipe_2.embedding

To find similar recipes, we can find the recipes with the shortest distance between their embedding values.

Let’s find the closest recipe to pizza. There is a recipe in this dataset with just the term Pizza with id = 431. This is the Pizza recipe that my son loves, so we will use it as our reference recipe.

Let's find the closest recipes to Pizza:

SELECT
        recipe_1.id,
        recipe_1.name,
        recipe_2.id,
        recipe_2.name
FROM (SELECT * FROM recipes WHERE name = 'Pizza') recipe_1,
        recipes AS recipe_2
ORDER BY recipe_1.embedding <=> recipe_2.embedding
LIMIT 4;

result:

 id  | name  | id  |       name
-----+-------+-----+-------------------
 431 | Pizza | 431 | Pizza
 431 | Pizza | 433 | Pizza, 12 in, fzn
 431 | Pizza | 126 | Chicken, parmesan
 431 | Pizza | 435 | Pizza, treats
(4 rows)

With the query above, the closest recipe to pizza is more pizza. Let's use SQL's NOT LIKE conditional statement to filter out other pizza:

SELECT
        recipe_1.id,
        recipe_1.name,
        recipe_2.id,
        recipe_2.name
FROM (SELECT * FROM recipes WHERE name = 'Pizza') recipe_1,
        recipes AS recipe_2
WHERE LOWER(recipe_2.name) NOT LIKE '%pizza%'
ORDER BY recipe_1.embedding <=> recipe_2.embedding
LIMIT 4;

result:

 id  | name  | id  |                name
-----+-------+-----+-------------------------------------
 431 | Pizza | 126 | Chicken, parmesan
 431 | Pizza |  31 | Beef, ground, hamburger, w/parmesan
 431 | Pizza | 211 | Dish, eggplant, parmesan
 431 | Pizza | 229 | Dish, lasagna
(4 rows)

In my scenario, I returned "Chicken, parmesan" as the best option! Excellent! I say "in my scenario" because we are using a few layers of non-deterministic outcomes -- data sent to OpenAI models may return different values, using the <=> operator may return different distances.

Yes, I think my son would love chicken parmesan. When using these recipes, remember to reduce portion sizes, else you'll make food for all of your neighbors and neighbor's neighbors.

Side quest: querying for opposites

What is the opposite of a corn dog? We can flip the order from ASC to DESC and find the maximum distance.

SELECT
        recipe_1.id,
        recipe_1.name,
        recipe_2.id,
        recipe_2.name
FROM (SELECT * FROM recipes WHERE name = 'Corn Dog' LIMIT 1) recipe_1,
        recipes AS recipe_2
WHERE
  recipe_2.description IS NOT NULL
ORDER BY recipe_1.embedding <=> recipe_2.embedding DESC
LIMIT 2;

Result:

 id  |   name   | id  |           name
-----+----------+-----+--------------------------
 165 | Corn Dog | 538 | Salad, green, tossed
 165 | Corn Dog | 519 | Salad, chicken, tropical
(2 rows)

According to the embedding distance, the opposite of Corn Dog is “Salad, green, tossed”. Yeah, that makes sense.

AI in Postgres

The tools we have available today make AI approachable for nearly all scenarios. Over the past 20 years, I can think of many different scenarios where I wish I had access to OpenAI capabilities. pg_vector has the first commits starting in 2021. So many things are coming together in a nice-to-use AI toolbox for the typical application developer. It’s awesome to see how Postgres will power developers to integrate AI capabilities in their products.

Hungry for more? Try pg_vector on Crunchy Bridge today!



from Hacker News https://ift.tt/rjfbwvi

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.