ROAPI Documentation

Project repository

ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code. It builds on top of Apache Arrow and Datafusion. The core of its design can be boiled down to the following:

  • Query frontends to translate SQL, GraphQL and REST API queries into Datafusion plans.
  • Datafusion for query plan execution.
  • Data layer to load datasets from a variety of sources and formats with automatic schema inference.
  • Response encoding layer to serialize intermediate Arrow record batch into various formats requested by client.

See below for a high level diagram:

roapi-design-diagram

Installation

Pre-built binary

Platform specific pre-built binaries for each release are hosted on our Github release page.

You can download and install them with a single command using pip:

pip install roapi-http

Docker

Pre-built docker images are hosted at ghcr.io/roapi/roapi-http.

Build from source

You need to install Rust toolchain if you haven't done so.

cargo install --locked --git https://github.com/roapi/roapi --branch main --bins roapi-http

Quick start

Spin up APIs for test_data/uk_cities_with_headers.csv and test_data/spacex-launches.json:

roapi-http \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    --table "test_data/spacex_launches.json"

Or using prebuilt docker image:

docker run -t --rm -p 8080:8080 ghcr.io/roapi/roapi-http:latest --addr 0.0.0.0:8080 \
    --table "uk_cities=test_data/uk_cities_with_headers.csv" \
    --table "test_data/spacex_launches.json"

Query API

Query tables using SQL, GraphQL or REST:

curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
curl -X POST -d "query { uk_cities(limit: 2) {city, lat, lng} }" localhost:8080/api/graphql
curl "localhost:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"

Sample response:

[
  {
    "city": "Elgin, Scotland, the UK",
    "lat": 57.653484,
    "lng": -3.335724
  },
  {
    "city": "Stoke-on-Trent, Staffordshire, the UK",
    "lat": 53.002666,
    "lng": -2.179404
  }
]

See Query frontends for details on different operators supported by each frontend.

Schema API

Get inferred schema for all tables:

curl localhost:8080/api/schema

Sample response:

{
  "uk_cities": {
    "fields": [
      {
        "name": "city",
        "data_type": "Utf8",
        "nullable": false,
        "dict_id": 0,
        "dict_is_ordered": false
      },
      {
        "name": "lat",
        "data_type": "Float64",
        "nullable": false,
        "dict_id": 0,
        "dict_is_ordered": false
      },
      {
        "name": "lng",
        "data_type": "Float64",
        "nullable": false,
        "dict_id": 0,
        "dict_is_ordered": false
      }
    ]
  }
}

Config

Command line argument

You can configure ROAPI to load as many tables as you want by repeating the --table argument:

roapi-http --table 'table1_name=table1_uri' --table 'table2_name=table2_uri'

You can use RUST_LOG environment variable to control the logging verbosity:

RUST_LOG=debug roapi-http ...

Here is the output from roapi-http --help:

roapi-http 0.3.3
QP Hou
Create full-fledged APIs for static datasets without writing a single line of code.

USAGE:
    roapi-http [OPTIONS]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -a, --addr <IP:PORT>                                           bind address
    -c, --config <config>                                          config file path
    -t, --table <[table_name=]uri[,option_key=option_value]>...
            Table sources to load. Table option can be provided as optional setting as part of the
            table URI, for example: `blogs=s3://bucket/key,format=delta`. Set table uri to `stdin`
            if you want to consume table data from stdin as part of a UNIX pipe. If no table_name is
            provided, a table name will be derived from the filename in URI.

YAML config

You can configure multiple table sources using YAML config, which supports more advanced format specific table options. For example:

addr: 0.0.0.0:8084
tables:
  - name: "blogs"
    uri: "test_data/blogs.parquet"

  - name: "ubuntu_ami"
    uri: "test_data/ubuntu-ami.json"
    option:
      format: "json"
      pointer: "/aaData"
      array_encoded: true
    schema:
      columns:
        - name: "zone"
          data_type: "Utf8"
        - name: "name"
          data_type: "Utf8"
        - name: "version"
          data_type: "Utf8"
        - name: "arch"
          data_type: "Utf8"
        - name: "instance_type"
          data_type: "Utf8"
        - name: "release"
          data_type: "Utf8"
        - name: "ami_id"
          data_type: "Utf8"
        - name: "aki_id"
          data_type: "Utf8"

  - name: "spacex_launches"
    uri: "https://api.spacexdata.com/v4/launches"
    option:
      format: "json"

  - name: "github_jobs"
    uri: "https://jobs.github.com/positions.json"

To run serve tables using config file:

roapi-http -c ./roapi.yml

Dataset formats

JSON

When a table uri ends in .json, ROAPI will try to load it as JSON table if no format option is specified:

tables:
  - name: "mytable"
    uri: "http://mytable.json"

Filter by JSON pointer

Sometimes the JSON array you want to serve might be stored inside a JSON object. To support this use-case, ROAPI supports loading JSON using a JSON pointer.

Take the following JSON data as an example:

{
    "x": {
        "y": [{"col1": "z"}, {"col1": "zz"}]
    }
}

In order to only serve [{"col1": "z"}, {"col1": "zz"}] through ROAPI, you can configure the JSON table source as:

tables:
  - name: "mytable"
    uri: "http://mytable.json"
    option:
      format: "json"
      pointer: "/x/y"

Array encoding

Each row in JSON data can be encoded using array for size reduction. This convention allows us to avoid repeating column names in every row.

For example:

[
    {"col1": 1, "col2": "abc"},
    {"col1": 2, "col2": "efg"}
]

Can be stored as:

[
    [1, "abc"],
    [2, "efg"]
]

However, when loading JSON rows using array encoding, you must explicitly specify the schema, since there is no column name in the datasource anymore for ROAPI to perform the schema inference.

tables:
  - name: "mytable"
    uri: "http://mytable.json"
    option:
      format: "json"
      array_encoded: true
    schema:
      columns:
        - name: "col1"
          data_type: "Int64"
        - name: "col2"
          data_type: "Utf8"

Parquet

When a table uri ends in .parquet, ROAPI will try to load it as Parquet table if no format option is specified:

tables:
  - name: "mytable"
    uri: "http://mytable.parquet"

You can partition a Parquet dataset into multiple partitions and load all of them into a single table by directory path:

tables:
  - name: "mytable"
    uri: "./table_dir"
    option:
      format: "parquet"

Large Datasets

ROAPI loads the entire table into memory as the default behavior. If your table is large or you want to avoid loading all data during startup, you can set an additional option use_memory_table: false (default: true). With that configuration, ROAPI will not copy the data into memory, but instructs datafusion to directly operate on the backing storage.

At the moment, this comes with the following limitations:

  1. no nested schema: datafusion#83
  2. missing support for cloud storage: datafusion#616

Example:

tables:
  - name: "mytable"
    uri: "./table_dir"
    option:
      format: "parquet"
      use_memory_table: false

Note that when providing use_memory_table option, it becomes necessary to also specify the format.

CSV

When a table uri ends in .csv, ROAPI will try to load it as CSV table if no format option is specified:

tables:
  - name: "mytable"
    uri: "http://mytable.csv"

You can partition a CSV dataset into multiple partitions and load all of them into a single table by directory path:

tables:
  - name: "mytable"
    uri: "./table_dir"
    option:
      format: "csv"

Google spreadsheet

To serve a Google spreadsheet as API, you need to gather the following config values:

  • Google spreadsheet URL, usually in the form of https://docs.google.com/spreadsheets/d/{SPREADSHEET_ID}#gid={SHEET_ID}.
  • (Optional) Google spreadsheet sheet title (bottom of the spreadsheet UI). This is required if SHEET_ID is not specified in URL through #gid.
  • Google spreadsheet service account secret key.

Here are the steps to configure the service account:

  1. Activate Google Sheets API in the Google API Console.
  2. Create a service account: https://console.developers.google.com/apis/api/sheets.googleapis.com/credentials.
  3. Go into service account setting, click ADD KEY -> Create new key. Then select JSON format and save it somewhere safe as a file. gsheet-add-key
  4. Copy email address for your newly created service account, usually in the format of {ACCOUNT_NAME}@{PROJECT_ID}.iam.gserviceaccount.com. gsheet-service-account-email
  5. Open the Google spreadsheet that you want to serve, then share it with the newly created service account using the service account email. gsheet-share-service-account

Now you can configure ROAPI to load the google spreadsheet into a table using:

tables:
  - name: "table_name"
    uri: "https://docs.google.com/spreadsheets/d/1-lc4oij04aXzFSRMwVBLjU76s-K0-s6UPc2biOvtuuU#gid=0"
    option:
      format: "google_spreadsheet"
      application_secret_path: "path/to/service_account_key.json"
      # sheet_title is optional if `#gid` is specified in uri
      sheet_title: "sheet_name_within_google_spreadsheet"

ROAPI only invokes Google Sheets API during the initial data load, subsequent query requests are served using in memory data.

Example

Here is what it looks like to serve this public google spreadsheet with ROAPI.

Start server:

$ roapi-http -c local.yaml
[2021-03-01T00:08:03Z INFO  roapi_http::api] loading `https://docs.google.com/spreadsheets/d/1-lc4oij04aXzFSRMwVBLjU76s-K0-s6UPc2biOvtuuU#gid=0` as table `properties`
[2021-03-01T00:08:04Z INFO  roapi_http::api] registered `https://docs.google.com/spreadsheets/d/1-lc4oij04aXzFSRMwVBLjU76s-K0-s6UPc2biOvtuuU#gid=0` as table `properties`
[2021-03-01T00:08:04Z INFO  actix_server::builder] Starting 8 workers
[2021-03-01T00:08:04Z INFO  actix_server::builder] Starting "actix-web-service-127.0.0.1:8080" service on 127.0.0.1:8080

Query with aggregation using SQL frontend:

$ curl -s -X POST localhost:8080/api/sql --data-binary @- <<EOF | jq
SELECT DISTINCT(Landlord), COUNT(Address)
FROM properties
GROUP BY Landlord
EOF

[
  {
    "Landlord": "Roger",
    "COUNT(Address)": 3
  },
  {
    "Landlord": "Mike",
    "COUNT(Address)": 4
  },
  {
    "Landlord": "Daniel",
    "COUNT(Address)": 3
  },
  {
    "Landlord": "Sam",
    "COUNT(Address)": 2
  },
  {
    "Landlord": "Carl",
    "COUNT(Address)": 3
  }
]

Query with filter using GraphQL frontend:

$ curl -s -X POST localhost:8080/api/graphql --data-binary @- <<EOF | jq
query {
  properties(
    filter: {
      Bed: { gt: 3 }
      Bath: { gte: 3 }
    }
    sort: [
      { field: "Bed", order: "desc" }
    ]
  ) {
    Address
    Bed
    Bath
  }
}
EOF

[
  {
    "Address": "Kenmore, WA",
    "Bed": 4,
    "Bath": 3,
    "Monthly_Rent": "$4,000"
  },
  {
    "Address": "Fremont, WA",
    "Bed": 5,
    "Bath": 3,
    "Monthly_Rent": "$4,500"
  }
]

Note: when inferring table schema from Google spreadsheets, ROAPI automatically replaces spaces in column names with _(underscore).

NDJSON

NDJSON stands for Newline delimited JSON. It is a convenient format for storing or streaming structured data that may be processed one record at a time.

When a table uri ends in .ndjson, ROAPI will try to load it as NDJSON table if no format option is specified:

tables:
  - name: "mytable"
    uri: "http://mytable.ndjson"

Delta Lake

ROAPI supports loading Delta tables from Delta Lake through delta-rs. Since a Delta table doesn't have an extension suffix, ROAPI cannot infer table format from table URI alone. Therefore, the format option needs to be set to delta explicitly for Delta table sources:

tables:
  - name: "mytable"
    uri: "s3://bucket/delta_table/path"
    option:
      format: "delta"

Large Datasets

ROAPI loads the entire table into memory as the default behavior. If your table is large or you want to avoid loading all data during startup, you can set an additional option use_memory_table: false (default: true). With that configuration, ROAPI will not copy the data into memory, but instructs datafusion to directly operate on the backing storage.

At the moment, this comes with the following limitations:

  1. no nested schema: datafusion#83
  2. missing support for cloud storage: datafusion#616

Example:

tables:
  - name: "mytable"
    uri: "./path/to/delta_table"
    option:
      format: "delta"
      use_memory_table: false

Note that when providing use_memory_table option, it becomes necessary to also specify the format.

Arrow

When a table uri ends in .arrow or .arrows, ROAPI will try to load it as Arrow IPC file or stream if no format option is specified:

tables:
  - name: "mytable"
    uri: "http://mytable.arrow" # or .arrows

You can partition an Arrow dataset into multiple partitions and load all of them into a single table by directory path:

tables:
  - name: "mytable"
    uri: "./table_dir"
    option:
      format: "arrow" # or arrows

Blob store

ROAPI currently supports the following blob storages:

  • Filesystem
  • HTTP/HTTPS
  • S3

Filesystem

Filesystem store can be specified using file: or filesystem: schemes. In a Windows environment, the scheme is mandatory. On Unix systems, a uri without a scheme prefix is treated as filesystem backed data source by ROAPI.

For example, to serve a local parquet file test_data/blogs.parquet, you can just set the uri to the file path:

tables:
  - name: "blogs"
    uri: "test_data/blogs.parquet"

Filesystem store supports loading partitioned tables. In other words, you can split up the table into mulitple files and load all of them into a single table by setting uri to the directory path. When loading a partitioned dataset, you will need to manually specify table format since the uri will not contain table format as a suffix:

tables:
  - name: "blogs"
    uri: "test_data/blogs/"
    option:
        format: "parquet"

HTTP/HTTPS

ROAPI can build tables from datasets served through HTTP protocols. However, one thing to keep in mind is HTTP store doesn't support partitioned datasets because there is no native directory support in HTTP protocol.

S3

ROAPI can build tables from datasets hosted in S3 buckets. Configuration is similar to filesystem store:

tables:
  - name: "TABLE_NAME"
    uri: "s3://BUCKET/TABLE/KEY"
    option:
        format: "csv"

Note that AWS region needs to be manually specified through AWS_REGION environment variable.

To configure S3 credentials, you can use IAM role or set the following environment variables if you are using IAM user:

  • AWS_SECRET_ACCESS_KEY
  • AWS_ACCESS_KEY_ID

API

Query frontends

SQL

To query tables using a subset of standard SQL, send the query payload through POST request to /api/sql endpoint. This is the only query interface that supports table joins.

SQL query frontend is the more flexible and powerful compared to REST and GraphQL. It is the only query frontend that supports table joins.

SQL frontend supports [""] operator for accessing struct fields and array element. For example struct_col["key1"]["key2"] or array_col[0].

REST

Query tables through REST API by sending GET requests to /api/tables/{table_name}. Query operators are specified as query params.

REST query frontend currently supports the following query operators:

  • columns
  • sort
  • limit
  • filter

To sort column col1 in ascending order and col2 in descending order, set query param to: sort=col1,-col2.

To find all rows with col1 equal to string 'foo', set query param to: filter[col1]='foo'. You can also do basic comparisons with filters, for example predicate 0 <= col2 < 5 can be expressed as filter[col2]gte=0&filter[col2]lt=5.

GraphQL

To query tables through GraphQL, send the query through POST request to /api/graphql endpoint.

GraphQL query frontend supports the same set of operators supported by REST query frontend. Here how is you can apply various operators in a query:

{
    table_name(
        filter: {
            col1: false
            col2: { gteq: 4, lt: 1000 }
        }
        sort: [
            { field: "col2", order: "desc" }
            { field: "col3" }
        ]
        limit: 100
    ) {
        col1
        col2
        col3
    }
}

Schema

Get schemas for all tables:

curl localhost:8080/api/schema

Get schema for a specific table by name:

curl localhost:8080/api/schema/{TABLE_NAME}

Response serialization

By default, ROAPI encodes responses in JSON format, but you can request different encodings by specifying the ACCEPT header:

curl -X POST \
    -H 'ACCEPT: application/vnd.apache.arrow.stream' \
    -d "SELECT launch_library_id FROM spacex_launches WHERE launch_library_id IS NOT NULL" \
    localhost:8080/api/sql

ROAPI currently supports the following serialization formats:

  • application/json
  • application/csv
  • application/vnd.apache.arrow.file
  • application/vnd.apache.arrow.stream
  • application/parquet