Pandas and PyArrow

Because Lance is built on top of Apache Arrow , LanceDB is tightly integrated with the Python data ecosystem, including Pandas and pyarrow . The sequence of steps in a typical workflow with Pandas is shown below.

Create dataset

Let’s first import LanceDB:

import lancedb

Next, we’ll import pandas

import pandas as pd

Sync API

We’ll first connect to LanceDB.

    uri = "data/sample-lancedb"
    db = lancedb.connect(uri)

We can create a LanceDB table directly from a Pandas DataFrame by passing it as the data parameter:

    data = pd.DataFrame(
        {
            "vector": [[3.1, 4.1], [5.9, 26.5]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }
    )
    table = db.create_table("pd_table", data=data)

Similar to the pyarrow.write_dataset() method, LanceDB’s db.create_table() accepts data in a variety of forms, including pyarrow datasets.

Async API

Connect to LanceDB:

    uri = "data/sample-lancedb"
    async_db = await lancedb.connect_async(uri)

We can create a LanceDB table directly from a Pandas DataFrame by passing it as the data parameter:

    data = pd.DataFrame(
        {
            "vector": [[3.1, 4.1], [5.9, 26.5]],
            "item": ["foo", "bar"],
            "price": [10.0, 20.0],
        }
    )
    await async_db.create_table("pd_table_async", data=data)

Larger-than-memory data

If you have a dataset that is larger than memory, you can create a table with Iterator[pyarrow.RecordBatch] to lazily load the data:

from typing import Iterable

import pyarrow as pa

def make_batches() -> Iterable[pa.RecordBatch]:
    for i in range(5):
        yield pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]]),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            ["vector", "item", "price"],
        )

You can then pass the make_batches() function to the data parameter, while specifying the pyarrow schema in the create_table() function.

Sync API

    schema = pa.schema(
        [
            pa.field("vector", pa.list_(pa.float32())),
            pa.field("item", pa.utf8()),
            pa.field("price", pa.float32()),
        ]
    )
    table = db.create_table("iterable_table", data=make_batches(), schema=schema)

Async API

    schema = pa.schema(
        [
            pa.field("vector", pa.list_(pa.float32())),
            pa.field("item", pa.utf8()),
            pa.field("price", pa.float32()),
        ]
    )
    await async_db.create_table(
        "iterable_table_async", data=make_batches(), schema=schema
    )

You will find detailed instructions of creating a LanceDB dataset in Getting Started and API sections.

Vector search

We can now perform similarity search via the LanceDB Python API.

Sync API

    # Open the table previously created.
    table = db.open_table("pd_table")

    query_vector = [100, 100]
    # Pandas DataFrame
    df = table.search(query_vector).limit(1).to_pandas()
    print(df)

Async API

    # Open the table previously created.
    async_tbl = await async_db.open_table("pd_table_async")

    query_vector = [100, 100]
    # Pandas DataFrame
    df = await (await async_tbl.search(query_vector)).limit(1).to_pandas()
    print(df)

This returns a Pandas DataFrame as follows:

code

    vector     item  price    _distance
0  [5.9, 26.5]  bar   20.0  14257.05957

If you have a simple filter, it’s faster to provide a where clause to LanceDB’s search method. For more complex filters or aggregations, you can always resort to using the underlying DataFrame methods after performing a search.

Sync API

    # Apply the filter via LanceDB
    results = table.search([100, 100]).where("price < 15").to_pandas()
    assert len(results) == 1
    assert results["item"].iloc[0] == "foo"

    # Apply the filter via Pandas
    df = results = table.search([100, 100]).to_pandas()
    results = df[df.price < 15]
    assert len(results) == 1
    assert results["item"].iloc[0] == "foo"

Async API

    # Apply the filter via LanceDB
    results = await (await async_tbl.search([100, 100])).where("price < 15").to_pandas()
    assert len(results) == 1
    assert results["item"].iloc[0] == "foo"

    # Apply the filter via Pandas
    df = results = await (await async_tbl.search([100, 100])).to_pandas()
    results = df[df.price < 15]
    assert len(results) == 1
    assert results["item"].iloc[0] == "foo"