Subsetting with Neosync
Neosync offers a powerful subsetting feature that allows users to efficiently manage and extract relevant data from complex database schemas. This feature is particularly useful for maintaining data integrity and optimizing data processing tasks.
Key Features of Subsetting
1. Foreign Key-Based Subsetting
Neosync leverages foreign key constraints to automatically subset related tables. When a subset condition (such as a WHERE clause) is specified on a table, Neosync can propagate this condition to related tables through foreign key relationships. This ensures that only the relevant data is included in your subset, maintaining referential integrity across your data. Neosync processes the subsetting directionally from top down to ensure data integrity.
- Example: If you have a
users
table with a conditionid < 100
and anorders
table with a foreign key tousers
, Neosync will ensure that only orders associated with the subset of users are included.
2. Enabling Subsetting by Foreign Key
This feature can be toggled on or off when creating your job. When enabled, Neosync uses foreign key relationships to create appropriate joins between tables, ensuring that your subset maintains referential integrity throughout the database. If this feature is not enabled, the WHERE clause is only applied to the table it is set on, and related tables will not be subsetted based on foreign key constraints.
- Multiple Subset Queries: You can add multiple subset queries to different tables, and Neosync will handle the relationships between them appropriately.
3. Handling Complex Relationships
Neosync is adept at managing self-referencing tables and circular dependencies. It can handle complex database schemas, provided there is at least one nullable column within the circular dependency cycle to serve as a viable entry point.
- Circular Dependencies: Neosync can identify and manage circular dependencies, ensuring that data is processed in the correct order to maintain integrity.
4. Query Building and Execution
Neosync builds SQL SELECT queries for each table based on the specified subset conditions. It supports various database drivers, including PostgreSQL, MySQL, and SQL Server, and can generate queries that respect foreign key constraints. Users can view the generated SQL SELECT queries on the job run page.
Conclusion
Neosync's subsetting feature simplifies the process of creating data subsets from complex databases. By leveraging foreign key constraints and handling complex relationships, Neosync ensures that all related data is cohesively maintained while preserving the relationships between your tables.
Examples Diagram
Examples: Subsetting by Foreign Key Enabled
Subsetting with a WHERE Clause on the addresses
Table
Consider the diagram above where we have three tables: customers
, addresses
, and orders
. The addresses
table has a foreign key relationship with the customers
table, and the orders
table has a foreign key relationship with the addresses
table.
If a WHERE clause is applied to the addresses
table, Neosync will propagate this condition to the related tables as follows:
Addresses Table: The WHERE clause is directly applied to the addresses
table. For example, if the condition is city = 'New York'
, only the addresses in New York will be included in the subset.
SELECT
*
FROM
addresses
WHERE
city = 'New York';
Customers Table: Neosync will identify the foreign key relationship between the addresses
and customers
tables. It will then include only the customers who have addresses that meet the WHERE clause condition. This ensures that the subset of customers is relevant to the subset of addresses.
SELECT
*
FROM
customers
JOIN addresses ON customers.address_id = addresses.id
WHERE
addresses.city = 'New York';
Orders Table: Neosync will identify the foreign key relationships between the orders
, addresses
, and customers
tables. It will join the orders
table with both the addresses
and customers
tables to ensure that only the orders associated with the addresses meeting the WHERE clause condition and their corresponding customers are included. This ensures that the subset of orders is relevant to the subset of addresses and customers.
SELECT
*
FROM
orders
JOIN customers ON orders.customer_id = customers.id
JOIN addresses ON orders.address_id = addresses.id
WHERE
addresses.city = 'New York';
Payments Table: Neosync will identify the foreign key relationships between the payments
, orders
, addresses
, and customers
tables. It will join the payments
table with both the orders
, addresses
, and customers
tables to ensure that only the payments associated with the orders meeting the WHERE clause condition and their corresponding addresses and customers are included. This ensures that the subset of payments is relevant to the subset of addresses and customers.
SELECT
*
FROM
payments
JOIN orders ON payments.order_id = orders.id
JOIN customers ON orders.customer_id = customers.id
JOIN addresses ON orders.address_id = addresses.id
WHERE
addresses.city = 'New York';
Subsetting with a WHERE Clause on the customers
Table
If a WHERE clause is applied to the customers
table, Neosync will propagate this condition to the related tables as follows:
Addresses Table: No subset is applied to the addresses
table because Neosync takes a top-down approach.
Customers Table: Neosync will apply the WHERE clause directly to the customers
table. For example, if the condition is name = 'Jane'
, only the customers named Jane will be included in the subset.
SELECT
*
FROM
customers
WHERE
customers.name = 'Jane';
Orders Table: Neosync will identify the foreign key relationships between the orders
and customers
tables. It will join the orders
table with the customers
table to ensure that only the orders associated with the customers meeting the WHERE clause condition are included. This ensures that the subset of orders is relevant to the subset of customers.
SELECT
*
FROM
orders
JOIN customers ON orders.customer_id = customers.id
WHERE
customers.name = 'Jane';
Payments Table: Neosync will identify the foreign key relationships between the payments
, orders
, and customers
tables. It will join the payments
table with both the orders
and customers
tables to ensure that only the payments associated with the orders meeting the WHERE clause condition and their corresponding customers are included. This ensures that the subset of payments is relevant to the subset of customers.
SELECT
*
FROM
payments
JOIN orders ON payments.order_id = orders.id
JOIN customers ON orders.customer_id = customers.id
WHERE
customers.name = 'Jane';
Subsetting with a WHERE Clause on the orders
Table
If a WHERE clause is applied to the orders
table, Neosync will propagate this condition to the related tables as follows:
Addresses Table: No subset is applied to the addresses
table because Neosync takes a top-down approach.
Customers Table: No subset is applied to the customers
table because Neosync takes a top-down approach.
Payments Table: No subset is applied to the payments
table because Neosync takes a top-down approach.
Orders Table: Neosync will apply the WHERE clause directly to the orders
table. For example, if the condition is total_amount > 100
, only orders with a total amount greater than 100 will be included in the subset.
Subsetting with WHERE Clauses on Both addresses
and customers
Tables
Suppose you want to subset data such that you only include addresses in New York and customers named Jane. Neosync will handle these conditions as follows:
Addresses Table: The WHERE clause is directly applied to the addresses
table to include only addresses in New York.
SELECT
*
FROM
addresses
WHERE
city = 'New York';
Customers Table: The WHERE clause is directly applied to the customers
table to include only customers named Jane.
SELECT
*
FROM
customers
JOIN addresses ON customers.address_id = addresses.id
WHERE
addresses.city = 'New York'
AND customers.name = 'Jane';
Orders Table: Neosync will join the orders
table with both the addresses
and customers
tables. It will ensure that only orders associated with the addresses in New York and customers named Jane are included.
SELECT
*
FROM
orders
JOIN customers ON orders.customer_id = customers.id
JOIN addresses ON orders.address_id = addresses.id
WHERE
addresses.city = 'New York'
AND customers.name = 'Jane';
Payments Table: Neosync will join the payments
table with the orders
, addresses
, and customers
tables. It will ensure that only payments associated with the orders meeting both the WHERE clause conditions on addresses and customers are included.
SELECT
*
FROM
payments
JOIN orders ON payments.order_id = orders.id
JOIN customers ON orders.customer_id = customers.id
JOIN addresses ON orders.address_id = addresses.id
WHERE
addresses.city = 'New York'
AND customers.name = 'Jane';
In this example, Neosync effectively combines the subset conditions on both the addresses
and customers
tables to ensure that all related data in the orders
and payments
tables is relevant to the specified conditions.
Examples: Subsetting by Foreign Key Disabled
Subsetting with WHERE Clauses on Both addresses
and customers
Tables
Neosync will handle these conditions as follows with subsetting by foreign key disabled:
Addresses Table: The WHERE clause is directly applied to the addresses
table to include only addresses in New York.
SELECT * FROM addresses WHERE city = 'New York';
Customers Table: The WHERE clause is directly applied to the customers
table to include only customers named Jane.
SELECT * FROM customers WHERE customers.name = 'Jane';
Orders Table: Since foreign key subsetting is disabled, no joins or filters will be applied to the orders table. The orders table will be selected in its entirety without any filtering conditions.
SELECT * FROM orders;
Payments Table: Since foreign key subsetting is disabled, no joins or filters will be applied to the payments table. The payments table will be selected in its entirety without any filtering conditions.
SELECT * FROM payments;
In this example, since foreign key subsetting is disabled, Neosync applies the WHERE clauses independently to each table. The addresses
and customers
tables get their respective conditions, while the orders
and payments
tables remain unfiltered since no direct WHERE clauses were specified for them. This means all orders and payments will be included, regardless of their relationships to the filtered addresses and customers. Unless you are certain this won't cause foreign key violations in your target database, it's recommended to enable the "Skip Foreign Key Violations" option for jobs where foreign key subsetting is disabled.