|
| 1 | +--- |
| 2 | +title: Import data from Parquet files |
| 3 | +description: Leverage Parquet files in Memgraph operations. Our detailed guide simplifies the process for an enhanced graph computing journey. |
| 4 | +--- |
| 5 | + |
| 6 | +import { Callout } from 'nextra/components' |
| 7 | +import { Steps } from 'nextra/components' |
| 8 | +import { Tabs } from 'nextra/components' |
| 9 | +import {CommunityLinks} from '/components/social-card/CommunityLinks' |
| 10 | + |
| 11 | +# Import data from Parquet file |
| 12 | + |
| 13 | +The data from Parquet files can be imported using the [`LOAD PARQUET` Cypher clause](#load-parquet-cypher-clause) from the local disk |
| 14 | +and from the s3. |
| 15 | + |
| 16 | +## `LOAD PARQUET` Cypher clause |
| 17 | + |
| 18 | +The `LOAD PARQUET` clause uses a background thread that reads the Parquet file |
| 19 | +in column batches, assembles them into row batches of 64K rows and places those |
| 20 | +batches into a queue. The main thread then pulls each batch from the queue and |
| 21 | +processes it row by row. For every row, it binds the parsed values to the |
| 22 | +specified variables and either populates the database (if it is empty) or |
| 23 | +appends the new rows to an existing dataset. |
| 24 | + |
| 25 | + |
| 26 | +### `LOAD PARQUET` clause syntax |
| 27 | + |
| 28 | +The syntax of the `LOAD PARQUET` clause is: |
| 29 | + |
| 30 | +```cypher |
| 31 | +LOAD PARQUET FROM <parquet-location> ( WITH CONFIG configs=configMap ) ? AS <variable-name> |
| 32 | +``` |
| 33 | + |
| 34 | +- `<parquet-location>` is a string that specifies where the Parquet file is |
| 35 | + located.<br/> |
| 36 | + If the path **does not** start with `s3://`, it is treated as a local file |
| 37 | + path. If it **does** start with `s3://`, Memgraph retrieves the file from the |
| 38 | + S3-compatible storage using the provided URI. There are no restrictions on the |
| 39 | + file’s location within your local file system, as long as the path is valid |
| 40 | + and the file exists. If you are using Docker to run Memgraph, you will need to |
| 41 | + [copy the files from your local directory into |
| 42 | + Docker](/getting-started/first-steps-with-docker#copy-files-from-and-to-a-docker-container) |
| 43 | + container where Memgraph can access them. <br/> |
| 44 | + |
| 45 | +- `<configs>` Represents an optional configuration map through which you can |
| 46 | + specify configuration options: `aws_region`, `aws_access_key`, |
| 47 | + `aws_secret_key` and `aws_endpoint_url`. |
| 48 | + - `<aws_region>`: The region in which your S3 service is being located |
| 49 | + - `<aws_access_key>`: Access key used to connect to S3 service |
| 50 | + - `<aws_secret_key>`: Secret key used to connect S3 service |
| 51 | + - `<aws_endpoint_url`>: Optional configuration parameter. Can be used to set |
| 52 | + the URL of the S3 compatible storage. |
| 53 | +- `<variable-name>` is a symbolic name representing the variable to which the |
| 54 | + contents of the parsed row will be bound to, enabling access to the row |
| 55 | + contents later in the query. The variable doesn't have to be used in any |
| 56 | + subsequent clause. |
| 57 | + |
| 58 | +### `LOAD PARQUET` clause specificities |
| 59 | + |
| 60 | +When using the `LOAD PARQUET` clause please keep in mind: |
| 61 | + |
| 62 | +- **Type handling:** <br/> |
| 63 | + The parser reads each value using its native Parquet type, so you should |
| 64 | + receive the same data type inside Memgraph. The following types are supported: |
| 65 | + **BOOL, INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, HALF_FLOAT, |
| 66 | + FLOAT, DOUBLE, STRING, LARGE_STRING, STRING_VIEW, DATE32, DATE64, TIME32, |
| 67 | + TIME64, TIMESTAMP, DURATION, DECIMAL128, DECIMAL256, BINARY, LARGE_BINARY, |
| 68 | + FIXED_SIZE_BINARY, LIST, MAP.** <br/> |
| 69 | + Any unsupported types are automatically stored as strings. Note that |
| 70 | + `UINT64_T` values are cast to `INT64_T` because Memgraph does not support |
| 71 | + unsigned 64-bit integers, and the Cypher standard only defines 64-bit signed |
| 72 | + integers. |
| 73 | + |
| 74 | +- **Authentication parameters:** <br/> |
| 75 | + Parameters for accessing S3-compatible storage (`aws_region`, |
| 76 | + `aws_access_key`, `aws_secret_key`, and `aws_endpoint_url`) can be provided in |
| 77 | + three ways: |
| 78 | + |
| 79 | + 1. Directly in the `LOAD PARQUET` query using the `WITH CONFIG` clause. |
| 80 | + 2. Through environment variables: `AWS_REGION`, `AWS_ACCESS_KEY`, |
| 81 | + `AWS_SECRET_KEY`, and `AWS_ENDPOINT_URL`. |
| 82 | + 3. Through run-time database settings, using: `SET DATABASE SETTING <key> TO |
| 83 | + <value>;` The corresponding setting keys are: `aws.access_key`, |
| 84 | + `aws.region`, `aws.secret_key`, and `aws.endpoint_url`. |
| 85 | + |
| 86 | + |
| 87 | +- **The `LOAD PARQUET` clause is not a standalone clause**, meaning a valid query |
| 88 | + must contain at least one more clause, for example: |
| 89 | + |
| 90 | + ```cypher |
| 91 | + LOAD PARQUET FROM "/people.parquet" AS row |
| 92 | + CREATE (p:Person) SET p += row; |
| 93 | + ``` |
| 94 | + |
| 95 | + In this regard, the following query will throw an exception: |
| 96 | + |
| 97 | + ```cypher |
| 98 | + LOAD PARQUET FROM "/file.parquet" AS row; |
| 99 | + ``` |
| 100 | + |
| 101 | + **Adding a `MATCH` or `MERGE` clause before `LOAD PARQUET`** allows you to |
| 102 | + match certain entities in the graph before running `LOAD PARQUET`, optimizing |
| 103 | + the process as matched entities do not need to be searched for every row in |
| 104 | + the `PARQUET` file. |
| 105 | + |
| 106 | + But, the `MATCH` or `MERGE` clause can be used prior the `LOAD PARQUET` clause |
| 107 | + only if the clause returns only one row. Returning multiple rows before |
| 108 | + calling the `LOAD PARQUET` clause will cause a Memgraph runtime error. |
| 109 | + |
| 110 | +- **The `LOAD PARQUET` clause can be used at most once per query**, so queries |
| 111 | + like the one below will throw an exception: |
| 112 | + |
| 113 | + ```cypher |
| 114 | + LOAD PARQUET FROM "/x.parquet" AS x |
| 115 | + LOAD PARQUET FROM "/y.parquet" AS y |
| 116 | + CREATE (n:A {p1 : x, p2 : y}); |
| 117 | + ``` |
| 118 | + |
| 119 | +### Increase import speed |
| 120 | + |
| 121 | +You can significantly increase data-import speed when using the `LOAD PARQUET` |
| 122 | +clause by taking advantage of indexing, batching, and analytical storage mode. |
| 123 | + |
| 124 | +#### 1. Create indexes |
| 125 | + |
| 126 | +`LOAD PARQUET` can establish relationships much faster if |
| 127 | +[indexes](/fundamentals/indexes) on nodes or node properties are created *after* |
| 128 | +loading the associated nodes: |
| 129 | + |
| 130 | +```cypher |
| 131 | +CREATE INDEX ON :Node(id); |
| 132 | +``` |
| 133 | + |
| 134 | +If `LOAD PARQUET` is **merging** existing data rather than creating new records, |
| 135 | +then create the indexes **before** running the import. |
| 136 | + |
| 137 | +#### 2. Use Periodic commits |
| 138 | + |
| 139 | +The `USING PERIODIC COMMIT <BATCH_SIZE>` construct optimizes memory allocation |
| 140 | +and can improve import speed by **25–35%** based on our benchmarks. |
| 141 | + |
| 142 | +```cypher |
| 143 | +USING PERIODIC COMMIT 1024 |
| 144 | +LOAD PARQUET FROM "/x.parquet" AS x |
| 145 | +CREATE (n:A {p1: x, p2: y}); |
| 146 | +``` |
| 147 | + |
| 148 | +#### 3. Switch to analytical storage mode |
| 149 | + |
| 150 | +Import performance can also improve by switching Memgraph to [analytical storage |
| 151 | +mode](/fundamentals/storage-memory-usage#storage-modes), which relaxes ACID |
| 152 | +guarantees except for manually created snapshots. Once the import is complete, |
| 153 | +you can switch back to transactional mode to restore full ACID guarantees. |
| 154 | + |
| 155 | +Switch storage modes within a session: |
| 156 | + |
| 157 | +``` |
| 158 | +STORAGE MODE IN_MEMORY_{TRANSACTIONAL|ANALYTICAL}; |
| 159 | +``` |
| 160 | + |
| 161 | +#### 4. Run Imports in Parallel |
| 162 | + |
| 163 | +When using `IN_MEMORY_ANALYTICAL` mode and storing nodes and relationships in |
| 164 | +separate Parquet files, you can run multiple concurrent `LOAD PARQUET` queries |
| 165 | +to accelerate the import even further. |
| 166 | + |
| 167 | +For best performance: |
| 168 | + |
| 169 | +1. Split node and relationship data into smaller files. |
| 170 | +2. Run all `LOAD PARQUET` statements that **create nodes** first. |
| 171 | +3. Then run all `LOAD PARQUET` statements that **create relationships**. |
| 172 | + |
| 173 | + |
| 174 | +### Usage example |
| 175 | + |
| 176 | +In this example, we will import multiple Parquet files with distinct graph |
| 177 | +objects. The data is split across four files, each file contains nodes of a |
| 178 | +single label or relationships of a single type. |
| 179 | + |
| 180 | +<Steps> |
| 181 | + |
| 182 | + {<h3 className="custom-header">Parquet files</h3>} |
| 183 | + |
| 184 | + - [`people_nodes.parquet`](s3://download.memgraph.com/asset/docs/people_nodes.parquet) is used to create nodes labeled `:Person`.<br/> The file contains the following data: |
| 185 | + ```parquet |
| 186 | + id,name,age,city |
| 187 | + 100,Daniel,30,London |
| 188 | + 101,Alex,15,Paris |
| 189 | + 102,Sarah,17,London |
| 190 | + 103,Mia,25,Zagreb |
| 191 | + 104,Lucy,21,Paris |
| 192 | + ``` |
| 193 | +- [`restaurants_nodes.parquet`](s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet) is used to create nodes labeled `:Restaurants`.<br/> The file contains the following data: |
| 194 | + ```parquet |
| 195 | + id,name,menu |
| 196 | + 200,Mc Donalds,Fries;BigMac;McChicken;Apple Pie |
| 197 | + 201,KFC,Fried Chicken;Fries;Chicken Bucket |
| 198 | + 202,Subway,Ham Sandwich;Turkey Sandwich;Foot-long |
| 199 | + 203,Dominos,Pepperoni Pizza;Double Dish Pizza;Cheese filled Crust |
| 200 | + ``` |
| 201 | + |
| 202 | +- [`people_relationships.parquet`](s3://download.memgraph.com/asset/docs/people_relationships.parquet) is used to connect people with the `:IS_FRIENDS_WITH` relationship.<br/> The file contains the following data: |
| 203 | + ```parquet |
| 204 | + first_person,second_person,met_in |
| 205 | + 100,102,2014 |
| 206 | + 103,101,2021 |
| 207 | + 102,103,2005 |
| 208 | + 101,104,2005 |
| 209 | + 104,100,2018 |
| 210 | + 101,102,2017 |
| 211 | + 100,103,2001 |
| 212 | + ``` |
| 213 | +- [`restaurants_relationships.parquet`](s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet) is used to connect people with restaurants using the `:ATE_AT` relationship.<br/> The file contains the following data: |
| 214 | + ```parquet |
| 215 | + PERSON_ID,REST_ID,liked |
| 216 | + 100,200,true |
| 217 | + 103,201,false |
| 218 | + 104,200,true |
| 219 | + 101,202,false |
| 220 | + 101,203,false |
| 221 | + 101,200,true |
| 222 | + 102,201,true |
| 223 | + ``` |
| 224 | + |
| 225 | + {<h3 className="custom-header">Import nodes</h3>} |
| 226 | + |
| 227 | + Each row will be parsed as a map, and the |
| 228 | + fields can be accessed using the property lookup syntax (e.g. `id: row.id`). Files can be imported directly from s3 or can be downloaded and then accessed from the local disk. |
| 229 | + |
| 230 | + The following query will load row by row from the file, and create a new node |
| 231 | + for each row with properties based on the parsed row values: |
| 232 | + |
| 233 | + ```cypher |
| 234 | + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_nodes.parquet" AS row |
| 235 | + CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city}); |
| 236 | + ``` |
| 237 | + |
| 238 | + In the same manner, the following query will create new nodes for each restaurant: |
| 239 | + |
| 240 | + ```cypher |
| 241 | + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_nodes.parquet" AS row |
| 242 | + CREATE (n:Restaurant {id: row.id, name: row.name, menu: row.menu}); |
| 243 | + ``` |
| 244 | + |
| 245 | + {<h3 className="custom-header">Create indexes</h3>} |
| 246 | + |
| 247 | + Creating an [index](/fundamentals/indexes) on a property used to connect nodes |
| 248 | + with relationships, in this case, the `id` property of the `:Person` nodes, |
| 249 | + will speed up the import of relationships, especially with large datasets: |
| 250 | + |
| 251 | + ```cypher |
| 252 | + CREATE INDEX ON :Person(id); |
| 253 | + ``` |
| 254 | + |
| 255 | + {<h3 className="custom-header">Import relationships</h3>} |
| 256 | + The following query will create relationships between the people nodes: |
| 257 | + |
| 258 | + ```cypher |
| 259 | + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/people_relationships.parquet" AS row |
| 260 | + MATCH (p1:Person {id: row.first_person}) |
| 261 | + MATCH (p2:Person {id: row.second_person}) |
| 262 | + CREATE (p1)-[f:IS_FRIENDS_WITH]->(p2) |
| 263 | + SET f.met_in = row.met_in; |
| 264 | + ``` |
| 265 | + |
| 266 | + The following query will create relationships between people and restaurants where they ate: |
| 267 | + |
| 268 | + ```cypher |
| 269 | + LOAD PARQUET FROM "s3://download.memgraph.com/asset/docs/restaurants_relationships.parquet" AS row |
| 270 | + MATCH (p1:Person {id: row.PERSON_ID}) |
| 271 | + MATCH (re:Restaurant {id: row.REST_ID}) |
| 272 | + CREATE (p1)-[ate:ATE_AT]->(re) |
| 273 | + SET ate.liked = ToBoolean(row.liked); |
| 274 | + ``` |
| 275 | + |
| 276 | + {<h3 className="custom-header">Final result</h3>} |
| 277 | + Run the following query to see how the imported data looks as a graph: |
| 278 | + |
| 279 | + ``` |
| 280 | + MATCH p=()-[]-() RETURN p; |
| 281 | + ``` |
| 282 | + |
| 283 | +  |
| 284 | + |
| 285 | +</Steps> |
| 286 | + |
| 287 | +<CommunityLinks/> |
0 commit comments