Просмотр исходного кода

chore(seer grouping): Collect metric on HTML stacktraces (#74827)

We've been running into trouble in the Seer grouping backfill when confronted with stacktraces full of HTML files, which are token-dense and often meaningless. Here's a recent example of such a stacktrace:

```
"frames": [
    {
      "filename": "index.html?__geo_region=jp&loc=eyjrawqioiiydks4rjniyvrlwekwovb5yxdrrno4iiwiywxnijoirvmyntyifq.eyjzdwiioijhmvpdmevvnuuilcjhdwqioijndxj1z3vydsisimnvdw50cnkioijkucisimnyzwf0zwqioje3mtk4otu0mzisimlzcyi6imcxmjmtyxv0acisimn1cnjlbmn5ijoislbziiwizxhwijoxnzixmjy5nzuylcjyzwdpb24ioijkucisimxhbmcioijqysisimlhdci6mtcymta5njk1miwianrpijoicgnfnelbovpovel1cfrhsllncemyce9lwij9.wefd0fvomovr_gjrcquzatrsmstgrvzqew7uhuyiibajhas7m_hyceqkigikwyybvlsqxhdqrwywsrxqthmjeq&lang=jp&platform=jorp1&mode=0",
      "function": "t",
      "context_line": '<!DOCTYPE HTML><html><head><meta charset="utf-8"><title></title><meta name="viewport" content="width=device-width,initial-scale=1,minimum-sc {snip}'
    },
    # and about 50 more frames just like this
]
```

To help deal with these stacktraces, we're going to start stripping querystrings from the filename[1] and counting context lines which only have one `{snip}`, like the one above, as minified for stacktrace string truncation purposes[2]. We're also considering excluding HTML frames from the Seer stacktrace, but before we do that, it'd be nice to know how often we're dealing with this problem.

This PR adds a temporary metric to track whether the stacktraces we send to Seer consist of all HTML frames, some HTML frames, or no HTML frames. Depending on how common HTML stacktraces are, we can decide how much effort we want to put into detecting and excluding HTML frames.


[1] https://github.com/getsentry/sentry/pull/74825
[2] https://github.com/getsentry/sentry/pull/74826
Katie Byers 7 месяцев назад
Родитель
Сommit
a6f413df65
1 измененных файлов с 26 добавлено и 0 удалено
  1. 26 0
      src/sentry/seer/similarity/utils.py

+ 26 - 0
src/sentry/seer/similarity/utils.py

@@ -37,6 +37,7 @@ def get_stacktrace_string(data: dict[str, Any]) -> str:
         exceptions = exceptions[0].get("values")
         exceptions = exceptions[0].get("values")
 
 
     frame_count = 0
     frame_count = 0
+    html_frame_count = 0  # for a temporary metric
     stacktrace_str = ""
     stacktrace_str = ""
     found_non_snipped_context_line = False
     found_non_snipped_context_line = False
     result_parts = []
     result_parts = []
@@ -74,6 +75,17 @@ def get_stacktrace_string(data: dict[str, Any]) -> str:
                     if not _is_snipped_context_line(frame_dict["context-line"]):
                     if not _is_snipped_context_line(frame_dict["context-line"]):
                         found_non_snipped_context_line = True
                         found_non_snipped_context_line = True
 
 
+                    # Not an exhaustive list of tests we could run to detect HTML, but this is only
+                    # meant to be a temporary, quick-and-dirty metric
+                    # TODO: Don't let this, and the metric below, hang around forever. It's only to
+                    # help us get a sense of whether it's worthwhile trying to more accurately
+                    # detect, and then exclude, frames containing HTML
+                    if (
+                        frame_dict["filename"].endswith("html")
+                        or "<html>" in frame_dict["context-line"]
+                    ):
+                        html_frame_count += 1
+
                     frame_strings.append(
                     frame_strings.append(
                         f'  File "{frame_dict["filename"]}", function {frame_dict["function"]}\n    {frame_dict["context-line"]}\n'
                         f'  File "{frame_dict["filename"]}", function {frame_dict["function"]}\n    {frame_dict["context-line"]}\n'
                     )
                     )
@@ -97,6 +109,20 @@ def get_stacktrace_string(data: dict[str, Any]) -> str:
 
 
         stacktrace_str += header + "".join(frame_strings)
         stacktrace_str += header + "".join(frame_strings)
 
 
+    metrics.incr(
+        "seer.grouping.html_in_stacktrace",
+        sample_rate=1.0,
+        tags={
+            "html_frames": (
+                "none"
+                if html_frame_count == 0
+                else "all"
+                if html_frame_count == final_frame_count
+                else "some"
+            )
+        },
+    )
+
     return stacktrace_str.strip()
     return stacktrace_str.strip()