Browse Source

fix(txnames): Revert high threshold for running the clusterer (#49087)

As part of https://github.com/getsentry/team-ingest/issues/93, we merged
https://github.com/getsentry/sentry/pull/46503, to ensure we would not
run the clusterer for fresh projects until they collect a high amount of
unique transaction names. This was based on a suspicion that we would
otherwise declare all URL transactions as sanitized prematurely.

However, we did not have any data to back up this decision, and there is
no reason to impose this threshold from the algorithm's point of view:
There is already the (lower) `MERGE_THRESHOLD` which should prevent
low-quality replacement rules.

What we _do_ know is that we've seen a decline in the number of
transactions changed by clustering rules (see metric
`event.transaction_name_changes`), which might be because we are now too
strict about when we run the clusterer.
Joris Bayer 1 year ago
parent
commit
b24987f087

+ 1 - 1
src/sentry/ingest/transaction_clusterer/tasks.py

@@ -63,7 +63,7 @@ def cluster_projects(projects: Sequence[Project]) -> None:
                 span.set_data("project_id", project.id)
                 span.set_data("project_id", project.id)
                 tx_names = list(redis.get_transaction_names(project))
                 tx_names = list(redis.get_transaction_names(project))
                 new_rules = []
                 new_rules = []
-                if len(tx_names) >= redis.MAX_SET_SIZE:
+                if len(tx_names) >= MERGE_THRESHOLD:
                     clusterer = TreeClusterer(merge_threshold=MERGE_THRESHOLD)
                     clusterer = TreeClusterer(merge_threshold=MERGE_THRESHOLD)
                     clusterer.add_input(tx_names)
                     clusterer.add_input(tx_names)
                     new_rules = clusterer.get_rules()
                     new_rules = clusterer.get_rules()

+ 1 - 1
tests/sentry/ingest/test_transaction_clusterer.py

@@ -204,7 +204,7 @@ def test_run_clusterer_task(cluster_projects_delay, default_organization):
     project2 = Project(id=223, name="project2", organization_id=default_organization.id)
     project2 = Project(id=223, name="project2", organization_id=default_organization.id)
     for project in (project1, project2):
     for project in (project1, project2):
         project.save()
         project.save()
-        _add_mock_data(project, 10)
+        _add_mock_data(project, 4)
 
 
     spawn_clusterers()
     spawn_clusterers()