KNOWLEDGE BASE

Useful treasure trove of knowledge

Jiakan Wang (Statsig)
Monday, July 18, 2022, 10:17 PM
They aren’t being logged right now, this is definitely an area we can improve on, but hasn’t been a high priority for us yet. As you mentioned, this will only happen if the request itself was bad, which was not the case here
Jacob Hurwitz
Monday, July 18, 2022, 9:58 PM
I think that would only help if the request fails (and not if the timer/timeout wins the race), but could still be a helpful log item to look for.
Jacob Hurwitz
Monday, July 18, 2022, 9:58 PM
Looking through Statsig source now. I see where this is happening, and the parameter that lead to the retries/backoffs. By chance, does this error get logged anywhere? (Eg, `console.error`, or maybe the logs that eventually get sent back to Statsig?) https://github.com/statsig-io/js-client/blob/c2c660262da46c8a5ddf556770ffdce6a8b86044/src/StatsigNetwork.ts#L88-L92
Jiakan Wang (Statsig)
Monday, July 18, 2022, 9:37 PM
oh we do retry 3 times with after 1, 2, and 4 seconds. If the init timeout happens first though the promise will gets resolved before retry has fully finished
Jacob Hurwitz
Monday, July 18, 2022, 9:31 PM
https://statsigcommunity.slack.com/archives/C01QVL20EDD/p1658179416096449?thread_ts=1657921888.499019&cid=C01QVL20EDD If this happens, what does the JS/React SDK do? The documentation is very clear about the initial behavior: > If the SDK resolves before the network request has completed due to the timeout, it will continue work with local overrides, cached values, and then default values set in code. But does the SDK retry the `initialize` call? If it retries, what’s the logic for when and how often it retries?
Jacob Hurwitz
Monday, July 18, 2022, 9:29 PM
Two of those users work for the same company. Potentially another point in favor of the VPN theory. I’ll make sure to ask questions about their VPN setup, if any.
Jacob Hurwitz
Monday, July 18, 2022, 9:27 PM
Thanks for the user IDs, and for the max data. Interesting. So yes, it seems like either our telemetry is wrong in recording ~5min latency (it’s a third party package and I’m not sure how reliable it is — still hoping I can collect more data from the user who reported this, but haven’t heard back from them yet), or there’s something else causing the delay elsewhere in the network (ISP and VPN are good theories, and could also explain why some users more consistently experience issues).
Jiakan Wang (Statsig)
Monday, July 18, 2022, 9:26 PM
then run experiments and feature gates using this device id
Jiakan Wang (Statsig)
Monday, July 18, 2022, 9:26 PM
We don’t do session level aggregation for analytics purposes if that’s what you are asking. If you are able to use cookies, I would actually store a device ID in the cookie and use that both for your server and client side.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 9:24 PM
6404567 10669911 80856
Jacob Hurwitz
Monday, July 18, 2022, 9:24 PM
Sorry, I missed that as I was collecting data. Let me take a look.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 9:24 PM
Did you look at the MAX request time graph I sent above?
Jiakan Wang (Statsig)
Monday, July 18, 2022, 9:23 PM
There shouldn’t be any case where the request takes more than 20 seconds actually, anything taking more than that will be terminated and return 5XX, and this is sitting on the edge of our server. So my theory is that some throttling behavior by the user’s ISP or VPN service
Jacob Hurwitz
Monday, July 18, 2022, 9:23 PM
And again, since you calculate your p50 and p95 latency, it seems like you have latency logged on your end. Do you log the latency of all requests, are do you sample? Assuming you do not sample, is there a way to match up the latency we’ve recorded with what you’ve recorded?
Maggie (Statsig)
Monday, July 18, 2022, 9:23 PM
Kevin Ye I’m working on adding support for funnels based on custom metrics. Do you have an example of a funnel you want to create? I could get it set up for you in the backend as a test.
Jacob Hurwitz
Monday, July 18, 2022, 9:22 PM
Are you able to send me user IDs for the 3 users who could not successfully initialize? Our telemetry is client-side and sometimes blocked by content blockers, but I can see if we have similar logs/data for the other two. https://statsigcommunity.slack.com/archives/C01QVL20EDD/p1658167456291939?thread_ts=1657921888.499019&cid=C01QVL20EDD
Jacob Hurwitz
Monday, July 18, 2022, 9:20 PM
This is what Common Room’s telemetry shows happened to the user https://console.statsig.com/1RGPypreBznu0OcFtjp2pX/users/user_id/10669911 on 7/15, for their API calls to Statsig (all times in PT time zone): Calls to `rgstr`: • 12:44:39 56.2ms • 12:45:08 50.0ms • 12:48:52 91.5ms • 12:51:39 62.5ms • 12:51:48 97.6ms • 12:52:38 125.5ms • 12:52:48 47.7ms • 12:53:08 98.5ms • 13:04:03 156ms • 13:04:13 56.3ms Calls to `initialize`: • 12:51:33 27.5ms • 12:52:36 282800ms • 12:52:36 278800ms • 13:03:56 34.7ms • 13:03:56 326400ms • 13:09:23 271900ms To me, this is not consistent with the “flaky wifi” explanation. _Only_ the requests to `initialize` are slow, and they seem to be repeatably slow. To me, this is far more likely to be consistent with an issue where the Statsig server has high latency and is failing to respond in a timely fashion.
Sam Hibberd
Monday, July 18, 2022, 9:16 PM
Ah ok, and in that case how do you define a session, if they browse today and then pick it up again in a week.
Jacob Hurwitz
Monday, July 18, 2022, 9:15 PM
But the `initialize` endpoint should get called at the same time they first load our app. If the user’s wifi is down, presumably our app (Common Room) also won’t load. It again is peculiar to me that in the one case I dug into upthread (and I’m not saying all the cases are like this, but at least the one I dug into) the `initialize` endpoint is _consistently_ slow (it took ~5 min for that user multiple different times — I can pull the data again), while the `rgstr` endpoint and Common Room’s app endpoints were all fast. If that pattern is repeating across multiple users, that suggests potentially an issue on Statsig’s end, and not just “flaky wifi.”
Jiakan Wang (Statsig)
Monday, July 18, 2022, 8:57 PM
tbh 3% doesn’t sound that bad considering how often my wifi has problems :sweat_smile:. We literally have a “days since last wifi down” counter in the office and it gets reset all the time lol. We don’t have p99/99.9 today unfortunately, but our server 5XX errors happen to ~ 0.01% of requests in the past week, so even at 99.9% I’m pretty sure it would’ve returned by our server just fine. We also have MAX request time. Looking at the past week (each datapoint here is 1 hour and represent the request that took the longest within that window, of tens of millions of requests), the max do exceeds the default timeout of 3s quite often, but stays under 5s most of the time. If you want to make sure as many of your users are getting the correct value, you can try setting the timeout to be 5000 or something instead.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 8:48 PM
I see. If you are using cookies, that’s closer to a “device id” kind of id than a “session id”, correct? In that case, I’d recommend keeping it alive for as long as you can, given that it’s basically your user’s unique identifier
Jacob Hurwitz
Monday, July 18, 2022, 8:48 PM
Maybe the `/gates` endpoint of the Console API? https://docs.statsig.com/console-api/gates
Xiaoyu Yin
Monday, July 18, 2022, 8:08 PM
Hey statsig -- I noticed `checkGate` always returns a boolean, is there a way to determine if a gate exists at all?
Jacob Hurwitz
Monday, July 18, 2022, 8:48 PM
Maybe the `/gates` endpoint of the Console API? https://docs.statsig.com/console-api/gates
Jiakan Wang (Statsig)
Monday, July 18, 2022, 11:27 PM
Hi Xiaoyu Yin - if the gate does not exist (typo for example), we think the best thing to do is to return `false` , so that your app continues to function correctly. It’s recommended that during development, you check the “log stream” under the diagnostics section for the gate you are using. If it’s set up correctly, you should see logs coming in near real time to verify that everything is correct.
Joshua Sayavong
Monday, July 18, 2022, 8:01 PM
Unfortunately we don't use Segment in our stack. Could we leverage https://docs.statsig.com/integrations/event_webhook#outgoing|Outbound Webhooks to send an exposure directly to https://www.braze.com/docs/api/endpoints/user_data/post_user_identify/|Braze API? I do see some problems though: • Cannot specify headers for authing into Braze project • Cannot customize payload to fit into the api • Cannot filter by experiment (to control on quota)
Vineeth
Monday, July 18, 2022, 7:47 PM
You can use the Segment integration to flow back to Lime the actual assignment events - https://docs.statsig.com/integrations/data-connectors/segment#configuring-outbound-events
Joshua Sayavong
Monday, July 18, 2022, 7:39 PM
Hi all, is there a way to setup a callback for a subject exposure in Statsig? Particularly, I'm looking to setup a holdout group via Statsig and send the assignment to Braze so that we know to not send comms to the holdout group users. cc Omar Guenena (Lime)
Vineeth
Monday, July 18, 2022, 7:47 PM
You can use the Segment integration to flow back to Lime the actual assignment events - https://docs.statsig.com/integrations/data-connectors/segment#configuring-outbound-events
Joshua Sayavong
Monday, July 18, 2022, 8:01 PM
Unfortunately we don't use Segment in our stack. Could we leverage https://docs.statsig.com/integrations/event_webhook#outgoing|Outbound Webhooks to send an exposure directly to https://www.braze.com/docs/api/endpoints/user_data/post_user_identify/|Braze API? I do see some problems though: • Cannot specify headers for authing into Braze project • Cannot customize payload to fit into the api • Cannot filter by experiment (to control on quota)
Vineeth
Tuesday, July 19, 2022, 6:36 PM
Braze doesn't have a native integration with us, but we have several ways to export variant assignment (webhook, CDB integrations like Segment, Rudderstack, mParticle etc, and an API to export). Is there a CDB you currently use?
Jacob Hurwitz
Monday, July 18, 2022, 6:48 PM
So of 103 users to check that gate on 7/15, we had 3/103 that were not successfully initialized? Again, that seems oddly high. Do you have p99 and p99.9 numbers for initialize latency?
Sam Hibberd
Monday, July 18, 2022, 6:41 PM
i mean trying to replicate what the client side implementation as much as we can.
Sam Hibberd
Monday, July 18, 2022, 6:40 PM
we are just trying to get the best setup to also cover for guest users, previous conversations suggest that if we are using a customId then we need to continue to use that even after a user is logged in to ensure continuity. Just trying to work out what we should be using on a php side, setting session cookies, actually starting a php session (or maybe some technique i don't know about)
Jiakan Wang (Statsig)
Monday, July 18, 2022, 6:36 PM
Usually I’d say as long as the session lasts, but not sure if there are some nuances here that cause them to be different than traditional sessions
Jiakan Wang (Statsig)
Monday, July 18, 2022, 6:35 PM
Hi! Does the session id represent an actual user session? If so, how long does a session usually last?
Sam Hibberd
Monday, July 18, 2022, 6:33 PM
Hi Jiakan Wang (Statsig) tore (statsig) when we are setting sessionId server side with php are there any rules / best practices we should adopt to ensure the most optimal setup, expiry dates etc etc?
Jiakan Wang (Statsig)
Monday, July 18, 2022, 6:04 PM
Checked the 6 users from 7/15 (can only find 6 even though it shows 7 in the graph, not sure why), among them: • 1 had the event cached from July 13th and not sent to us until 15th • 1 was using local override and forced to false • 1 was on old sdk (id 21915297), could be blocked by ad blocker? The other 3 were not successfully initialized and using cache.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 5:56 PM
A higher `initTimeoutMs` definitely should help, if your goal is to make sure they get the most updated values. The default is 3000. When this happens, it’s most common for it to be on user’s side due to internet slowness. Our initialize’s latency has been really good, I just checked the numbers now and it’s been 3-5ms p50 and ~50ms p95 for the past week.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 5:42 PM
Sounds like they didn’t check the response from initialize request, and you might be able to get them to try again this week?
Jacob Hurwitz
Monday, July 18, 2022, 5:42 PM
Thanks for the info. For some more context though, we don’t have many daily active users right now, so even this level is a _significant_ and unacceptably high percentage for us. I really want to drill down and find the root cause here. Some thoughts: • Would a higher `initTimeoutMs` help? • When the initialize call times out, is it generally client-side (Statsig resolves quickly, but the user’s internet connection is slow) or server-side (the Statsig server hangs when responding to the API call)? https://statsigcommunity.slack.com/archives/C01QVL20EDD/p1658116026585149?thread_ts=1657921888.499019&cid=C01QVL20EDD
Jacob Hurwitz
Monday, July 18, 2022, 5:40 PM
The switch from cache to uninitialized may have been caused by our team suggesting to that user that they clear their cache on Fri afternoon, as part of debugging the problem.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 5:20 PM
Morning - just want to provide some update as i ran some queries this morning to confirm what has happened. I think it is indeed just a small % of users getting cached value (or sometimes uninitialized therefore default value). This is consistent from what we’ve seen elsewhere too. Here is what I checked: 1. I confirmed with my colleague that there shouldn’t have been any difference between hitting `http://featuregates.org|featuregates.org` vs. old domain; 2. I pulled details (see attached) of the 5 checks for the user mentioned above, and it looks like 4 of them had the reason `Cache` and 1 of them was `Uninitialized`, meaning the SDK wasn’t able to initialize successfully in that session, and there was no cache. The `Uninitialized` session happens to have a different stableID than the other 4 sessions, which makes sense. It seems like the user was having trouble connecting to Statsig that day for some reason; 3. I pulled data for `home_educational_content` gate (attached #2) where the user was getting `false` after the gate was rolled out, the reason was mostly `Uninitialized` - in the more recent days, meaning a user here and there were getting the default value (false) due to network reasons.
Jiakan Wang (Statsig)
Monday, July 18, 2022, 6:33 AM
If the initialize response returns within a reasonable amount of time, I really doubt they'd see the wrong experience, unless it's a regression in sdk v1.13.0
Jiakan Wang (Statsig)
Monday, July 18, 2022, 6:32 AM
I’d say inspect the initialize response and how long it took, and also log event/rgstr request body to find the exposure, which will tell us exactly what the value the user gets from the sdk and why

We use cookies to ensure you get the best experience on our website.

Privacy Policy