Skip to main content

· 6 min read

We are very happy to announce the release of Kùzu 0.2.0! This is a major release with two major new features: (i) RDFGraphs; and (ii) Kùzu extensions framework and our first extension for accessing files over HTTP(S) servers and on S3. We also have a set of improvements at the core that should make Kùzu faster behind the scenes and several other improvements, as discussed below.

For details on all the changes in this release, please see the change log of this release.

RDFGraphs

Kùzu's native data model is a version of the property graph model, where you model your records as a set of entities/nodes and relationships and properties on nodes and relationships. Kùzu's version of property graphs is, in fact, a structured property graph model, as Kùzu requires you to pre-specify the properties on your nodes and relationships. This is very close to the relational model. The primary difference is that you specify some of your tables as node tables and others as relationship tables.

The second popular graph-based data model in practice is Resource Description Framework (RDF). RDF is in fact more than a data model. It is part of a larger set of standards by the World Wide Web Consortium (W3C), such as RDF Schema and OWL, that form a well founded, well-standardized knowledge representation system. In contrast to the property graph model, RDF is particularly suitable for more flexible and heterogenous information representation. All information, including the actual data as well as the schema of your data, i.e., metadata, is represented homogeneously in the form of (subject, predicate, object) triples.

Kùzu 0.2.0 introduces native support for RDF through a new extension of its data model called RDFGraphs. RDFGraphs is a lightweight extension to Kùzu's data model that allows ingesting triples natively into Kùzu so that they can be queried using Cypher. It is a lightweight extension because an RDFGraph is simply a wrapper around 2 node and 2 relationship tables that acts as a new object in Kùzu's data model. For example you can CREATE/DROP RDFGraph <rdfgraph-name> to create or drop an RDFGraph, which will create or drop four underlying tables. You can then query these underlying tables with Cypher. Therefore, RDFGraphs are a specific mapping of your triples into Kùzu's native property graph data model, so that you can benefit from Kùzu's easy, scalable, and fast querying capabilities for basic querying of RDF triples.

In Short

You can now use Kùzu to store and query RDF data via Cypher!

This release is an important step in our vision to be the go-to system to model your records as graphs. Here is the example from our documentation of how you can use Kùzu to store and query RDF data. Consider a Turtle file uni.ttl modeling information about university students, faculty and cities they live in:

@prefix kz: <http://kuzu.io/rdf-ex#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

kz:Waterloo a kz:City ;
kz:name "Waterloo" ;
kz:population 150000 .

kz:Adam a kz:student ;
kz:livesIn kz:Waterloo ;
kz:name "Adam" ;
kz:age 30 .

You can create an RDFGraph named UniKG and import the above Turtle file into UniKG as follows:

CREATE RDFGraph UniKG;

COPY UniKG FROM "${PATH-TO-DIR}/uni.ttl";

You can then query all triples with IRI kz:Waterloo as subject as follows:

WITH "http://kuzu.io/rdf-ex#" as kz
MATCH (s {iri: kz+"Waterloo"})-[p:UniKG]->(o)
RETURN s.iri, p.iri, o.iri, o.val;

Output:
----------------------------------------------------------------------------------------------------------------------------
| s.iri | p.iri | o.iri | o.val |
----------------------------------------------------------------------------------------------------------------------------
| http://kuzu.io/rdf-ex#Waterloo | http://kuzu.io/rdf-ex#name | | Waterloo |
----------------------------------------------------------------------------------------------------------------------------
| http://kuzu.io/rdf-ex#Waterloo | http://kuzu.io/rdf-ex#population | | 150000 |
----------------------------------------------------------------------------------------------------------------------------
| http://kuzu.io/rdf-ex#Waterloo | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://kuzu.io/rdf-ex#City | |
----------------------------------------------------------------------------------------------------------------------------

Learn all about RDFGraphs, how to CREATE them, how to import triples into them from Turtle files, the property graph node and relationships they map to, how to query and modify them and all in our documentation page for RDFGraphs.

Extensions framework

Kùzu 0.2.0 introduces a new framework for extending Kùzu's capabilities, similar to PostreSQL's and DuckDB's extensions. Extensions are a way to add new features to Kùzu without modifying the core code. The 0.2.0 version is just the beginning of our development of this framework, and we are happy to release our first extension, httpfs, which supports reading data from a file hosted on an HTTP(S) server. httpfs can also be used to read from Amazon S3. You can use the httpfs extension by installing it and dynamically loading it as follows:

INSTALL httpfs;
LOAD EXTENSION httpfs;

You can then read files hosted remotely on a http(s) server or on Amazon S3 as follows:

LOAD FROM "https://raw.githubusercontent.com/kuzudb/extension/main/dataset/test/city.csv" 
RETURN *;

Output:
Waterloo|150000
Kitchener|200000
Guelph|75000

The following example shows how to read a file from Amazon S3:

LOAD FROM 's3://kuzu-test/follows.parquet'
RETURN *;

You can also write to S3 using the httpfs extension. Read all about it here in our documentation.

We have plans to implement additional extensions, such as to support new data types, functions and indices over time.

Improvements at the Core

We are also continuing non-stop to make the core of Kùzu faster and more efficient. We have improved our hash index building by parallelizing it (other parts of the copy pipeline were already parallelized) and through several other optimizations. This results in an improvement in bulk loading performance. Here is a comparison showing by how much we improved bulk loading performance of the LDBC Comments table, which consists of 220M records (~22 GB):

ThreadsKùzu 0.1.0Kùzu 0.2.0Performance improvement
1536.1496.57.4%
2289.1257.311.0%
4161.7137.515.0%
8116.877.633.5%

We have also improved our disk-based CSR implementation to make it faster when ingesting data through CREATE statements (intended for loading small amounts of data), and added constant compression all improving Kùzu's performance in some cases in minor ways.

Closing Remarks

In addition to the above, this release includes the following:

  • Several additional improvements to Kùzu's command line interface
  • A new UUID data type
  • Many improvements to our testing framework

These updates were all made by our amazing interns 😎. As always, we would like to thank everyone in the Kùzu team for making this release possible and look forward to user feedback!

· 14 min read
Prashanth Rao

Ever since the birth of database management systems (DBMSs), tabular relations and graphs have been the core data structures used to model application data in two broad classes of systems: relational DBMSs (RDBMS) and graph DBMSs (GDBMS).

In this post, we'll look at how to transform data that might exist in a typical relational system to a graph and load it into a Kùzu database. The aim of this post and the next one is to showcase "graph thinking"1, where you explore connections in your existing structured data and apply it to potentially uncover new insights.

Code

The code to reproduce the workflow shown in this post can be found in the graphdb-demo repository. It uses Kùzu's Python API, but you are welcome to use the client API of your choice.

· 20 min read
Semih Salihoğlu

In my previous post, I gave an overview of question answering (Q&A) systems that use LLMs over private enterprise data. I covered the architectures of these systems, the common tools developers use to build these systems when the enterprise data used is structured, i.e., data exists as records stored in some DBMS, relational or graph. I was referring to these systems as RAG systems using structured data. In this post, I cover RAG systems that use unstructured data, such as text files, pdf documents, or internal html pages in an enterprise. I will refer to these as RAG-U systems or sometimes simply as RAG-U (should have used the term RAG-S in the previous post!).

To remind readers, I decided to write these two posts after doing a lot of reading in the space to understand the role of knowledge graph (KGs) and graph DBMSs in LLM applications. My goals are (i) to overview the field to readers who want to get started but are intimidated by the area; and (ii) point to several future work directions that I find important.1


  1. In this post I'm only covering approaches that ultimately use retrieve some unstructured data (or a transformation of it) to put it into LLM prompts. I am not covering approaches that query a pre-existing KG directly and use the records in it as additional data into a prompt. See this post by Ben Lorica for an example. The 3 point bullet point after the "Knowledge graphs significantly enhance RAG models" describes such an approach. According to my organization of RAG approaches, such approaches would fall under RAG using structured data, since KGs are structured records.

· 26 min read
Semih Salihoğlu

During the holiday season, I did some reading on LLMs and specifically on the techniques that use LLMs together with graph databases and knowledge graphs. If you are new to the area like me, the amount of activity on this topic on social media as well as in research publications may have intimidated you. If so, you're exactly my target audience for this new blog post series I am starting. My goals are two-fold:

  1. Overview the area: I want to present what I learned with a simple and consistent terminology and at a more technical depth than you might find in other blog posts. I am aiming a depth similar to what I aim when preparing a lecture. I will link to many quality and technically satisfying pieces of content (mainly papers since the area is very researchy).
  2. Overview important future work: I want to cover several important future works in the space. I don't necessarily mean work for research contributions but also simple approaches to experiment with if you are building question answering (Q&A) applications using LLMs and graph technology.

This post covers the topic of retrieval augmented generation (RAG) using structured data. Then, in a follow up post, I will cover RAG using unstructured data, where I will also mention a few ways people are building RAG-based Q&A systems that use both structured and unstructured data.