Denormalize Data #1

Closed
opened 2025-05-18 01:50:39 +00:00 by williamp · 1 comment
Owner

Was debugging some issues while implementing ScyllaDB support and received a warning that the query I was using (based off on the original Fake DB queries) was not efficient. GH Copilot offered a suggestion that seemed fair. Note that it is mentioning the users queries, this could be applied elsewhere as well.

The warning occurs because ScyllaDB (and Cassandra) requires that queries use indexed columns or primary key components for efficient lookups. When you query by a non-primary key column (like name in dbGetUserByName), ScyllaDB needs to scan the entire table, which can lead to unpredictable performance. By default, ScyllaDB blocks such queries unless you explicitly allow them using ALLOW FILTERING.


Options to Resolve the Warning

1. Add ALLOW FILTERING (Quick Fix)

You can append ALLOW FILTERING to the query to bypass the restriction. However, this is not recommended for production use because it can lead to performance issues on large datasets.

func dbGetUserByName(username string) (*User, error) {
	query := `SELECT id, name, password FROM users WHERE name = ? ALLOW FILTERING`
	var user User
	err := db.Session.Query(query, username).Scan(&user.ID, &user.Name, &user.Password)
	if err == gocql.ErrNotFound {
		return nil, errors.New("User not found")
	} else if err != nil {
		return nil, fmt.Errorf("failed to query user: %v", err)
	}
	return &user, nil
}

When to Use: Only use this approach if the dataset is small or if you are certain that the query will not cause performance issues.


2. Create a Secondary Index

If you frequently query by name, you can create a secondary index on the name column. This allows ScyllaDB to efficiently look up rows by name.

CQL Command to Create the Index:

CREATE INDEX ON users (name);

Updated Query:
Once the index is created, you can use the same query without ALLOW FILTERING:

func dbGetUserByName(username string) (*User, error) {
	query := `SELECT id, name, password FROM users WHERE name = ?`
	var user User
	err := db.Session.Query(query, username).Scan(&user.ID, &user.Name, &user.Password)
	if err == gocql.ErrNotFound {
		return nil, errors.New("User not found")
	} else if err != nil {
		return nil, fmt.Errorf("failed to query user: %v", err)
	}
	return &user, nil
}

When to Use: Use this approach if querying by name is a common operation and the dataset is large. Secondary indexes work well for low-cardinality columns (e.g., name).


3. Denormalize the Data

Another approach is to create a separate table specifically for querying by name. This is a common pattern in ScyllaDB and Cassandra, where data is denormalized to optimize for specific query patterns.

CQL Schema:

CREATE TABLE users_by_name (
    name TEXT PRIMARY KEY,
    id UUID,
    password TEXT
);

Insert Data into Both Tables:
When adding a user, insert the data into both the users table and the users_by_name table.

func dbAddUser(user *User) error {
	// Insert into the main table
	query1 := `INSERT INTO users (id, name, password) VALUES (?, ?, ?)`
	err := db.Session.Query(query1, user.ID, user.Name, user.Password).Exec()
	if err != nil {
		return fmt.Errorf("failed to add user to users table: %v", err)
	}

	// Insert into the denormalized table
	query2 := `INSERT INTO users_by_name (name, id, password) VALUES (?, ?, ?)`
	err = db.Session.Query(query2, user.Name, user.ID, user.Password).Exec()
	if err != nil {
		return fmt.Errorf("failed to add user to users_by_name table: %v", err)
	}

	return nil
}

Query by Name:
Now you can query the users_by_name table efficiently:

func dbGetUserByName(username string) (*User, error) {
	query := `SELECT id, name, password FROM users_by_name WHERE name = ?`
	var user User
	err := db.Session.Query(query, username).Scan(&user.ID, &user.Name, &user.Password)
	if err == gocql.ErrNotFound {
		return nil, errors.New("User not found")
	} else if err != nil {
		return nil, fmt.Errorf("failed to query user: %v", err)
	}
	return &user, nil
}

When to Use: Use this approach if querying by name is critical and you want to avoid the overhead of secondary indexes.


Recommendation

  • For Small Datasets: Use ALLOW FILTERING as a quick fix.
  • For Larger Datasets: Use a secondary index or denormalize the data, depending on your query patterns and performance requirements.

Denormalization is the most scalable and performant solution for large datasets, as it aligns with ScyllaDB's design principles.

Was debugging some issues while implementing ScyllaDB support and received a warning that the query I was using (based off on the original Fake DB queries) was not efficient. GH Copilot offered a suggestion that seemed fair. Note that it is mentioning the users queries, this could be applied elsewhere as well. The warning occurs because ScyllaDB (and Cassandra) requires that queries use indexed columns or primary key components for efficient lookups. When you query by a non-primary key column (like `name` in `dbGetUserByName`), ScyllaDB needs to scan the entire table, which can lead to unpredictable performance. By default, ScyllaDB blocks such queries unless you explicitly allow them using `ALLOW FILTERING`. --- ### Options to Resolve the Warning #### 1. **Add `ALLOW FILTERING` (Quick Fix)** You can append `ALLOW FILTERING` to the query to bypass the restriction. However, this is not recommended for production use because it can lead to performance issues on large datasets. ```go func dbGetUserByName(username string) (*User, error) { query := `SELECT id, name, password FROM users WHERE name = ? ALLOW FILTERING` var user User err := db.Session.Query(query, username).Scan(&user.ID, &user.Name, &user.Password) if err == gocql.ErrNotFound { return nil, errors.New("User not found") } else if err != nil { return nil, fmt.Errorf("failed to query user: %v", err) } return &user, nil } ``` **When to Use**: Only use this approach if the dataset is small or if you are certain that the query will not cause performance issues. --- #### 2. **Create a Secondary Index** If you frequently query by `name`, you can create a secondary index on the `name` column. This allows ScyllaDB to efficiently look up rows by `name`. **CQL Command to Create the Index**: ```cql CREATE INDEX ON users (name); ``` **Updated Query**: Once the index is created, you can use the same query without `ALLOW FILTERING`: ```go func dbGetUserByName(username string) (*User, error) { query := `SELECT id, name, password FROM users WHERE name = ?` var user User err := db.Session.Query(query, username).Scan(&user.ID, &user.Name, &user.Password) if err == gocql.ErrNotFound { return nil, errors.New("User not found") } else if err != nil { return nil, fmt.Errorf("failed to query user: %v", err) } return &user, nil } ``` **When to Use**: Use this approach if querying by `name` is a common operation and the dataset is large. Secondary indexes work well for low-cardinality columns (e.g., `name`). --- #### 3. **Denormalize the Data** Another approach is to create a separate table specifically for querying by `name`. This is a common pattern in ScyllaDB and Cassandra, where data is denormalized to optimize for specific query patterns. **CQL Schema**: ```cql CREATE TABLE users_by_name ( name TEXT PRIMARY KEY, id UUID, password TEXT ); ``` **Insert Data into Both Tables**: When adding a user, insert the data into both the `users` table and the `users_by_name` table. ```go func dbAddUser(user *User) error { // Insert into the main table query1 := `INSERT INTO users (id, name, password) VALUES (?, ?, ?)` err := db.Session.Query(query1, user.ID, user.Name, user.Password).Exec() if err != nil { return fmt.Errorf("failed to add user to users table: %v", err) } // Insert into the denormalized table query2 := `INSERT INTO users_by_name (name, id, password) VALUES (?, ?, ?)` err = db.Session.Query(query2, user.Name, user.ID, user.Password).Exec() if err != nil { return fmt.Errorf("failed to add user to users_by_name table: %v", err) } return nil } ``` **Query by Name**: Now you can query the `users_by_name` table efficiently: ```go func dbGetUserByName(username string) (*User, error) { query := `SELECT id, name, password FROM users_by_name WHERE name = ?` var user User err := db.Session.Query(query, username).Scan(&user.ID, &user.Name, &user.Password) if err == gocql.ErrNotFound { return nil, errors.New("User not found") } else if err != nil { return nil, fmt.Errorf("failed to query user: %v", err) } return &user, nil } ``` **When to Use**: Use this approach if querying by `name` is critical and you want to avoid the overhead of secondary indexes. --- ### Recommendation - **For Small Datasets**: Use `ALLOW FILTERING` as a quick fix. - **For Larger Datasets**: Use a secondary index or denormalize the data, depending on your query patterns and performance requirements. Denormalization is the most scalable and performant solution for large datasets, as it aligns with ScyllaDB's design principles.
williamp self-assigned this 2025-05-18 01:50:51 +00:00
Author
Owner

Fixed in latest commit, ended up just doing a secondary index

Fixed in latest commit, ended up just doing a secondary index
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: williamp/chatservice_concept#1
No description provided.