Sources Configuration

Understanding sources (aka connectors)#

Sources (connectors) and Collection (streams)

Sources (or connectors) are used to import data from external API (Google Analytics, Facebook, etc) or databases (redis, firebase, etc) into destinations. Each source represents a connection to particular API.

Synchronization scheduling engine is called sync tasks sync tasks.

Jitsu supports 3 type of sources:

Native sources (example: Google Ads, Facebook). Those sources are written in Go and are a part of Jitsu code-base
Singer based sources. Singer as a collection of ETL-connectors written in Python (called 'taps'). Singer-based sources are not part of Jitsu codebase. Jitsu just runs the python package, processes output and saves data to a destination. Learn more about Singer-based sources configuration
Airbyte based sources. Airbyte is an ETL-framework similar to Sinnger. Airbyte sources are distributed as docker images. Jitsu pulls those images, runs them and puts output to a database. Learn more about Airbyte-based sources configuration

Collections (aka streams)#

Each source exports one or more collections (also called "streams" in Airbyte/Singer nomenclature). Example: slack source exports Users, Messages, Channels and few other collections. Each collection is represented by a table in a destination.

Collections may be static or configurable. Configuration usually defines a set of fields which are exported. Example Firebase collections (users, firestore) are static while Google Analytics collections is parametrized (Google Analytics has dimensions and metrics).

Native Connecting Configuration#

This section applies only to connectors that are native part of Jitsu. A full list of native connectors is: is: facebook, google-ads, google-analytics, redis, google-play, firebase, amplitude.

Other connectors (based either on Singer, or Airbyte) has a slighly different configuration syntax. Learn more abour Singer-based or Airbyte-based sources

Example of source configuration:

sources:
  firebase_example_id:
    type: firebase
    destinations:
      - "<DESTINATION_ID>"
    collections:
      - "<FIRESTORE_COLLECTION_ID>"
    config:
      project_id: "<FIREBASE_PROJECT_ID>"
      key: '<GOOGLE_SERVICE_ACCOUNT_KEY_JSON>'
  google_analytics_example_id:
    type: google_analytics
    destinations:
      - "<DESTINATION_ID>"
    collections:
      - name: "report_test"
        type: "report"
        schedule: '45 23 * * 6'
        parameters:
          dimensions:
            - "ga:country"
            - "ga:yearMonth"
          metrics:
            - "ga:sessions"
    config:
      view_id: "<VIEW_ID_VALUE>"
      auth:
        service_account_key: "<GOOGLE_SERVICE_ACCOUNT_KEY_JSON>"
  ...

Common yaml properties for all sources (all yaml properties are required):

Property	Description
`type`	determines the type of a data source from which data would be imported (like `google_analytics` or `firebase`)
`destinations`	list of destination ids where result must be stored
`collections`	list of collections to synchronize
`config`	custom parameters for each source type

To see how to configure some type of source, please visit documentation pages for exact source types.

This feature requires:

meta.storage configuration

primary_key_fields configuration (in Postgres destination case)

Collection Configuration#

Sources should define a list of collections (or stream) explicitly. Each collection defines a synchronization schedule, destination table name (table name will be prefixed with source_id to avoid collisions). Here's an example configuration snippet:

sources:
  firebase_example_id:
  collections:
    - name: "some_name"
      type: "collection_type_id"
      table_name: "table_name_for_data"
      start_date: "2020-06-01"
      schedule: '@daily' #cron expression. see below
      parameters:
        field1: "value"
        field2: ["values"]
        field3:
          some_object:
      ...

Full list of parameters

Parameter	Description
`name` (required)	is a unique identifier of collection within a list of collections
`type`	determines which data subset must be synchronized. If type absents, type equals to `name` parameter
`table_name`	name of the table to keep synchronized data. If not set, equals to the name of collection
`start_date`	start date string of data to download in `YYYY-MM-DD` format. Default values is `365` days ago
`schedule`	cron expression automatic collection synchronization schedule. If not set - only manual collection synchronization(by HTTP API) will be available
`parameters`	if the collection is parametrized, parameter values are set here. A value may be of any type (`string`, `number`, `boolean`, `list`, `object`). To get a full list of parameters, take a look to catalog

If the collection has no parameters, it may be configured only by its name as a string argument. For example:

collections: ["collection1_id", "collection2_id"]

Configuring sources via HTTP - endpoint#

If sources configuration is generated by an external service, it is possible to externalize via HTTP end - point (or file) as follows:

sources: 'location'

The location can behttp(s):// of a local file (/path/to/file) location and should contain YAML or (JSON that is identical to YAML structure). If the location is an URL, the client will respect If-Modified-Since / Last-Modified caching.

Example of URL content:

{
  "sources": { #json object where inner keys - sources unique ids
    "facebook_marketing_online_sales": { #source config object
      "type": "facebook_marketing",
      ...
    },
    "facebook_marketing_offline_sales": {
      "type": "facebook_marketing",
      ...
    }
  }
}

🚀 Quick Start

✈️ Sending data

📜 Configuration

❤️ Features

👩‍🔬 Extending Jitsu

Jitsu Internals