This post is about the second release of Kùzu. However, we want to start with something much more important:
Donate to the Victims of Türkiye-Syria Earthquake:
Our hearts, thoughts, and prayers go to all the victims, those who survived and those who passed, in Syria and Türkiye. There will be a very difficult winter for all those who survived so everyone needs to help. Here are two pointers for trustworthy organizations we know of that are trying to help victims on the ground. For Türkiye (where Semih is from), you can donate to Ahbap (Please be aware that the donation currency is in TL and 14 TL = 1 CAD; 19TL = 1 USD); and for Syria you can donate to the White Helmets. Be generous! We'll leave pointers to several other organizations below in this footnote1.
Overview of Kùzu 0.0.2
Back to our release. Kùzu codebase is changing fast but this release still has a focus: we have worked quite hard since the last release to integrate Kùzu to import data from different formats and export data to different formats. There are also several important features in the new Cypher clauses and queries we support, additional string processing capabilities, and new DDL statement support. We will give a summary of each of these below.
- General Kùzu Demo
- Export Query Results to Pytorch Geometric: Node Property Prediction Example
- Export Query Results to Pytorch Geometric: Link Prediction Example
- Export Query Results to NetworkX
Exporting Query Results to Pytorch Geometric and NetworkX
Perhaps most excitingly, we have added the first capabilities to integrate with 2 popular graph data science libraries: (i) Pytorch Geometric (PyG) for performing graph machine learning; and (ii) NetworkX for a variety of graph analytics, including visualization.
Our Python API now has a
QueryResult.get_as_torch_geometric() function that
converts results of queries to PyG's in-memory graph representation
If your query results contains nodes and relationship objects, then the function uses
those nodes and relationships to construct either
torch_geometric.data.HeteroData objects. The function also auto-converts any numeric or boolean property
on the nodes into tensors on the nodes that can be used as features in the
Any property that cannot be auto-converted and the edge properties are also returned in case you need
want to manually put them into the
Colab Demonstrations: Here are 2 Colab notebooks that you can play around with to see how you can develop graph learning pipelines using Kùzu as your GDBMSs:
The examples demonstrate how to extract a subgraph, train graph convolutional or neural networks (GCNs or GNNs), make some node property or link predictions and save them back in Kùzu so you can query these predictions.
Our Python API now has a
QueryResult.get_as_networkx() function that can convert query results
that contain nodes and relationships into NetworkX directed or undirected graphs. Using this function, you can build pipelines
that benefits from Kùzu's DBMS functionalities (e.g., querying, data extraction and transformations,
using a high-level query language with very fast performance), and NetworkX's rich library of
graph analytics algorithms.
Colab Demonstration: Here is a Colab notebook that you can play around with that shows how to do basic graph visualization of query results and build a pipeline that computes PageRanks of a subgraph and store those PageRank values back as new node properties in Kùzu and query them.
Data Import from and Export to Parquet and Arrow
We have removed our own CSV reader and instead now use Arrow
as our default library when bulk importing data through
COPY FROM statements.
Using Arrow, we can not only bulk import
from CSV files but also from arrow IPC and parquet files. We detect the file type
from the suffix of the file; so if the query says
COPY user FROM ./user.parquet,
we infer that this is a parquet file and parse it so. See the details here.
Multi-labeled or Unlabeled Queries
A very useful feature of the query languages of GDBMSs is their
ability to elegantly express unions of join queries.
We had written about this feature of GDBMSs in this blog post about
What Every Competent GDBMS Should Do
(see the last paragraph of Section
Feature 4: Schema Querying).
In Cypher, a good example
of this is to not bind the node and relationship variables to a specific node/relationship
labels/tables. Consider this query:
WHERE a.name = 'Karissa'
RETURN a, e, b
This query asks for all types of relationships that Karissa can have to any possible other
node (not necessarily of label
User) in the query. So if the database contains
Likes relationships from
LivesIn relationships from
variables e and b can bind to records from all of these
relationship and node labels, respectively.
You can also restrict the labels of nodes/rels to a fixed set that contains
more than one label using the
For example you can do:
WHERE a.name = 'Karissa'
RETURN a, e, b
This forces e to match to only Likes relationship or Follows relationship records (so
LivesIn records we mentioned above). The
| is a syntax adapted from
regexes originally and is also used in query languages that support
regular path queries.
Kùzu now supports such queries. Our query execution
is based on performing scans of each possible node/rel table and index
and when a variable
x can bind to multiple node/rel tables,
L1, L2, ..., Lk,
we reserve one vector for each possible property of each node/rel table.
If anyone has any optimizations to do something smarter, it would be very interesting to hear!
Other Important Changes
Enhanced String Features
We've added two important features to enhance Kùzu's ability to store and process strings:
1) Support of UTF-8 characters. With the help of utf8proc, you can now store string node/relationship
properties in Kùzu that has UTF-8 characters;
2) Support of regex pattern matching with strings. Kùzu now supports Cypher's
=~ operator for regex searches, which will return true if its pattern mathces the entire input string. For example:
RETURN 'abc' =~ '.*(b|d).*';.
We added ALTER TABLE and DROP TABLE DDL statements. After creating a new node or relationship table, you can now drop it, rename it, and alter it by adding new columns/properties, renaming or dropping existing columns/properties.
Disable Relationships with Multiple Source or Destination Labels
We now no longer support defining a relationship between multiple source or destination labels. This is to simplify our storage. But please let us know if you have strong use cases on this.
Enjoy our new release and don't forget to donate to the earthquake victims.
- For Türkiye two other organizations are AFAD, which is the public institute for coordinating natural disaster response and Akut, a volunteer-based and highly organized search and rescue group. For Syria, another campaign I can recommend is Molham Team, which is an organization founded by Syrian refugee students.↩