For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | maxgrinev's commentsregister

Great questions! Let me break this down:

Target audience:

1) Enterprise IT teams who already know SQL/YAML - they can build complex integrations after ~1 hour of training using our examples, no prior Python needed

2) Modern data teams using dbt - Sequor complements it perfectly for data ingestion and activation

What they gain:

Full flexibility with structure. Enterprise IT folks go from zero to building end-to-end solutions in an hour without needing developer support. Think "dbt but for API integrations."

Competitors & differentiation:

1) Zapier/n8n: GUI looks easy but gets complex fast, poor database integration, can't handle bulk data

2) Fivetran/Airbyte: Pre-built connectors only, zero customization, ingestion-only

3) Us: Only code-first solution using open tech stack (SQL+YAML+Python) - gives you flexibility with Fivetran reliability

Business model:

1) Core engine: Open source, free forever

2) Revenue: On-premise server with enterprise features (RBAC, observability and execution monitoring with notifications, audit logs) - flat fee per installation, no per-row costs like competitors

3) Services: Custom connector development and app-to-app integration flows (we love this work!)

4) Cloud version maybe later - everyone wants on-premise now

The key difference:

we're the only tool that's both easy to learn AND highly customizable for all major API integration patterns: data ingestion, reverse ETL, and multi-step iPaaS workflows - all in one platform.


Thank you for such an insightful suggestion and deep dive into the code - this is amazing feedback! I'll definitely switch to the ${{}} syntax you suggested.

Quick clarification on _expression: we intentionally use two templating systems - Jinja {{ }} for simple variable injection, and Python *_expression for complex logic that Jinja can't handle.

Actually, since we only use Jinja for variable substitution, should I just drop it entirely? We have another version implemented in Java/JavaScript that uses simple ${var-name} syntax, and we already have Python expressions for advanced scenarios. Might be cleaner to unify on ${var-name} + Python expressions.

Given how deeply you've looked into our system, would you consider using Sequor? I can promise full support including fundamental changes like these - your technical insight would be invaluable for getting the design right early on.


I'm not the target audience for this product, but I experience the pain from folks who embed jinja2/golang in yaml every single day, so I am trying to do whatever I can to nip those problems in the bud so maybe one day it'll stop becoming the default pattern

As for "complex logic that jinja can't handle," I am not able to readily identify what that would mean given that jinja has executable blocks but I do agree with you that its mental model can make writing imperative code inside those blocks painful (e.g. {% set _ = my_dict.update({"something":"else}) %} type silliness)

it ultimately depends on whether those _expression: stanzas are always going to produce a Python result or they could produce arbitrary output. If the former, then I agree with you jinja2 would be terrible for that since it's a templating language[1]. If the latter, then using jinja2 would be a harmonizing choice so the author didn't have to keep two different invocation styles in their head at once

1: one can see that in ansible via this convolution:

  body: >-
    {%- set foo = {} -%}
    {%- for i in ... -%}
    {%- endfor -%}
    {# now emit the dict as json #}
    {{ foo | to_json }}


Good catch! Yes, recalculating metrics across all historical data every run would be expensive in Snowflake. I chose this example for simplicity to show how the three operations work together, but you're absolutely right about the inefficiency. The flow can easily be optimized for incremental processing - pull only recent orders and update metrics for just the affected customers:

steps:

  # Step 1: Pull only NEW orders since last run

  - op: http_request
    request:
      source: "shopify"
      url: "https://{{ var('store_name') }}.myshopify.com/admin/api/{{ var('api_version') }}/orders.json"
      method: GET
      parameters:
        status: any
        updated_at_min_expression: "{{ last_run_timestamp() or '2024-01-01' }}"
      headers:
        "Accept": "application/json"
    response:
      success_status: [200]
      tables:
        - source: "snowflake"
          table: "shopify_orders_incremental"
          columns: { ... }
          data_expression: response.json()['orders']

  # Step 2: Update metrics ONLY for customers with new/changed orders
  - op: transform
    source: "snowflake"
    query: |
      MERGE INTO customer_metrics cm
      USING (
        SELECT 
          customer_id,
          SUM(total_price::FLOAT) as total_spend,
          COUNT(*) as order_count
        FROM shopify_orders 
        WHERE customer_id IN (
          SELECT DISTINCT customer_id 
          FROM shopify_orders_incremental
        )
        GROUP BY customer_id
      ) new_metrics
      ON cm.customer_id = new_metrics.customer_id
      WHEN MATCHED THEN 
        UPDATE SET 
          total_spend = new_metrics.total_spend,
          order_count = new_metrics.order_count,
          updated_at = CURRENT_TIMESTAMP()
      WHEN NOT MATCHED THEN
        INSERT (customer_id, total_spend, order_count, updated_at)
        VALUES (new_metrics.customer_id, new_metrics.total_spend, new_metrics.order_count, CURRENT_TIMESTAMP())

  # Step 3: Sync only customers whose metrics were just updated
  - op: http_request
    input:
      source: "snowflake"
      query: |
        SELECT customer_id, email, total_spend, order_count
        FROM customer_metrics 
        WHERE updated_at >= '{{ run_start_timestamp() }}'
    request:
      source: "mailchimp"
      url_expression: |
        f"https://us1.api.mailchimp.com/3.0/lists/{var('list_id')}/members/{hashlib.md5(record['email'].encode()).hexdigest()}"
      method: PATCH
      body_expression: |
        {
          "merge_fields": {
            "TOTALSPEND": record['total_spend'],
            "ORDERCOUNT": record['order_count']
          }
        }
This scales much better: if you have 100K customers but only 50 new orders, you're recalculating metrics for ~50 customers instead of all 100K. Same simple workflow pattern, just production-ready efficiency.

Does this address your concern or did you mean something else? Would you suggest I use a slightly more complex but optimized example for the main demo? Your feedback is welcome and appreciated!


I appreciate the response and detail. The code in your response definitely piqued my interest in the product more than the initial demo code does, but I do understand why you’d want simplicity on your homepage.


Dynamic YAML with computed properties could have applications beyond API integrations. We use Python since it's familiar to data engineers, but our original prototype with JavaScript had even more compact syntax. Would love feedback on our approach and other use cases for dynamic YAML.


You're right about the abstraction concern. The vast majority of the workflow stays in structured YAML - Python is only needed for two specific points: constructing HTTP request bodies and parsing JSON responses. These are inherently dynamic operations that need real programming logic. The alternatives - proprietary DSLs or visual builders - would be far more complex to learn and maintain than a few lines of Python/JavaScript for JSON manipulation. We actually started with JavaScript since it's more natural for JSON work, but the difference was marginal. Do you have suggestions for a better approach to these two specific dynamic parts? We're open to ideas that would be simpler than Python but still handle the complexity generically.


I think you’re misunderstanding. This product doesn’t make sense because the problem itself is not solvable under your given constraints.


Just asking. Which problem? Which constraints?


> The alternatives - proprietary DSLs or visual builders - would be far more complex to learn and maintain

that's where I disagree. Your YAML DSL is far harder to learn and maintain. My code can be tested, iterated, understood by my IDE etc. It's just code.


Totally understand - you want clean, unified data for business insights, not another integration tool to maintain. Sequor actually grew out of our Master Data Management (MDM) work where data cleaning and deduplication are core challenges. We focused on API integration for this release, but have mature data cleaning/deduplication components that we plan to open source as well. What specific data quality issues are you dealing with? Happy to share what we've learned from MDM projects and our data quality engine.


That's exactly where I landed too. We didn't need a modern data stack managed by a data team, where everything is coded with significant turnaround times. We ended up using an MDM (Syncari).

MDMs are unsexy and have a lot of baggage filled with legacy, expensive vendors. But the principles are sound, and more modern platforms have turned out to be pretty good.


Great question! Rate limiting and concurrency are absolutely critical for production API integrations. Here's how Sequor handles these challenges: Rate Limiting:

* Built-in rate limiting controls at the source level (requests per second/minute/hour): each http_request operations refers to an http source defined separately * Automatic backoff and retry logic with delays * There is a option for per-endpoint rate limit configuration since different API calls can have different limits * Because it is at the source level, it works properly even for parallel requests to the same source.

The key idea is that rate limits are handled by the engine - no need to handle it explicitely by the user.

Concurrency is explicit by the user: * Inter-operation parallelism is activated by adding begin_parallel_block and end_parallel_block - between these two operations all the operations are executed at once in parallel * Intra-operation parallelism: many operations have parameters to partition input data and run in parallel on partitions. For example, http_request takes an input table that contains data to be updated via API and you can partition the input table by key columns into specified number of partitions.

Thanks for the Steampipe reference! That's a really interesting approach - exposing APIs as Postgres tables is clever, and I'm definitely going to play with it.


Thank you for pointing this out! I wasn't aware of Arazzo before. It does look very similar indeed. I'll definitely look more into it and see how we might align with their format or adapt it to Sequor's capabilities and execution model. Really appreciate the heads up!


Fair point - 'intuitive' is subjective and depends on your background. Let me explain where each technology fits: HTTP request configuration (URL, headers, parameters) is pretty intuitive and similar to Postman that most people know. Python is used mainly for response parsing - without a general-purpose language in this place it would be very limiting. Templating is mainly for variable substitution, familiar to anyone using frameworks like dbt. And SQL stays separate from Python - they're not mixed together.


The Python here is doing exactly what it should - extracting nested data and flattening it into proper relational tables. Most of the 'Python' is actually just the return statement defining table schemas. The schema can be defined separately and reused across API calls. With such reuse the Python part would be really compact. This is the right balance between declarative config and the flexibility needed for real-world API responses. I can confirm it from my experience developing many integrations with Sequor - this approach handles real-world complexity much better than pure declarative tools.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You