Skip to content

Conversation

@kashifkhan0771
Copy link
Contributor

@kashifkhan0771 kashifkhan0771 commented Aug 29, 2025

Description:

This Pull request replaces the HTTP API requests for Github source PR's, Issues and their comments with GraphQL requests.

Scan Results with no verification:

Note: While results may vary slightly between runs, GraphQL consistently outperforms REST in scan duration.

Screenshot from 2025-09-03 13-40-13

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@kashifkhan0771 kashifkhan0771 marked this pull request as ready for review September 1, 2025 13:36
@kashifkhan0771 kashifkhan0771 requested review from a team as code owners September 1, 2025 13:36
@kashifkhan0771 kashifkhan0771 marked this pull request as draft September 1, 2025 13:36
@kashifkhan0771 kashifkhan0771 self-assigned this Sep 1, 2025
@kashifkhan0771 kashifkhan0771 marked this pull request as ready for review September 3, 2025 11:03
@kashifkhan0771 kashifkhan0771 requested a review from a team as a code owner September 3, 2025 11:03
Copy link
Contributor

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially skeptical of putting the new code in graphql.go, but I think I see why you did it and I can't think of a better option.

However, I did have some trouble following the rate limiting code - in particular, how it interacts with the rest of the GitHub rate limiting code. Specifically, it's doing a lot of its own stuff before it begins to use the rest of the rate limiting code, and I don't understand why it needs to do that beforehand. (Also, mentioned inline, 5 minutes is a huge interval, and Prometheus gets skipped in many cases.)

Can you try another pass over the rate limiting code to see if things can be clarified a bit?

ctx.Logger().Info("GraphQL RATE_LIMITED error (fallback)",
"retry_after", retryAfter.String())
time.Sleep(retryAfter)
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This early return (and all of the later ones) happen before we report rate limiting to Prometheus, which means that we won't have much visibility when it happens. Relatedly, I find it a bit hard to follow this new rate limiting code - I suspect the difficulty was with integrating it with the existing rate limiting code. (For example: Why does this initial request have a 3x 5minute retry step before getting into the global rate limiting code?) You might be able to resolve both issues at once with another refactoring pass at this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the reporting to Prometheus. Thanks for pointing out!

The reason I initially set the request interval to 5 minutes was because the GraphQL API sometimes returns a rate limit error before the remaining value reaches 0. When this happens, the response includes no rate limit info, so we can't rely on the usual rate limit object and the resetAt value defaults to zero, which can cause incorrect handling.

To deal with this, I first handled the error case directly, then move on to process the rate limit if available.

Your comment gave me a new perspective, so I adjusted the approach a bit and now:

  1. We first check if the global rate limit is already set.

    • If it's set, we use that value and delay accordingly.
  2. If it's not set, and we receive an error:

    • We check if it's a rate limit issue.
    • In that case, we set the global rate limit (so other requests know to wait) and continue reporting to Prometheus.
  3. If there's no error, we check the actual rate limit object:

    • If remaining < 3, we wait until the resetAt time.
    • Why 3? Because some GraphQL queries can cost more than one point, depending on complexity.

func (s *Source) handleGraphQLRateLimit(ctx context.Context, rl *rateLimit, errIn error, reporters ...errorReporter) bool {
// if rate limit exceeded error happened, wait for 5 minute before trying again
if errIn != nil && strings.Contains(errIn.Error(), "rate limit exceeded") {
retryAfter := 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge retry interval. Does GitHub prescribe it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the above comment.

Copy link
Contributor

@camgunz camgunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good, just had a couple notes. We should add tests for this--probably the integration tests should cover this (maybe a unit test or two for chunkIDs), so if we just run the integration tests w/ UseGithubGraphqlAPI both false and true, that'd give us even more confidence we're not breaking anything.

main.go Outdated
feature.GitlabProjectsPerPage.Store(100)

// OSS Default using github graphql api for issues, pr's and comments
feature.UseGithubGraphqlAPI.Store(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start this off as false so it's not automatically turned on for EE customers (we can/should make it true for OSS after we get the flag in EE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, regretfully I think the initialism is GraphQL--if we're gonna do it for REST we should probably do it here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll temporarily turn it off. I didn't quite understand your second comment - just to clarify, we don’t use any initialism for REST.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was just saying the "QL" is capitalized; Graphql -> GraphQL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)

func (s *Source) processRepoComments(ctx context.Context, repoInfo repoInfo, reporter sources.ChunkReporter, cutoffTime *time.Time) error {
func (s *Source) processIssueandPRsWithCommentsREST(ctx context.Context, repoInfo repoInfo, reporter sources.ChunkReporter, cutoffTime *time.Time) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm an 80 columns person because I have a bunch of buffers open side-by-side, but I accept I'm a relic from an earlier time. This one (and at least one more below) are > 150 though--can we wrap somewhere between 90 and 120?

return s.processRepoComments(ctx, repoInfo, reporter, cutoffTime)
// if we need to use graphql api for repo issues, prs and comments
if feature.UseGithubGraphqlAPI.Load() {
return s.processRepoIssueandPRsWithCommentsGraphql(ctx, repoInfo, reporter, cutoffTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No need to do anything here, just musing)

Hrm, w/ the old code passing cutoffTime just the once was kind of 🤷🏻 , but now that we're carrying it around everywhere, it makes me think it'd be nice if we processed s.commentsTimeframeDays up top in Init, that way we don't need to drill it down everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound's good we can do it in a separate optimization PR.

return err
}

// if a pull request has more than 100 comments - some interns pull request might endup here :)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂

@kashifkhan0771
Copy link
Contributor Author

We should add tests for this--probably the integration tests should cover this (maybe a unit test or two for chunkIDs), so if we just run the integration tests w/ UseGithubGraphqlAPI both false and true, that'd give us even more confidence we're not breaking anything.

I added the test but even the existing integration tests are not working for me 😿

Copy link
Contributor

@martinlocklear martinlocklear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments about rate limiting, and a strong opinion about tabular tests. (I've been drawing folks attention to testify's testsuite functionality to serve similar purposes but be more maintainable).

In general I think PRs this size are too large to effectively review - nothing stands out to me as a problem though.

// request this issue more comments
var commentsQuery singleIssueComments
err := s.connector.GraphQLClient().Query(ctx, &commentsQuery, commentVars)
if s.handleGraphqlRateLimitWithChunkReporter(ctx, reporter, &query.RateLimit, err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if s.handleGraphqlRateLimitWithChunkReporter(ctx, reporter, &query.RateLimit, err) {
if s.handleGraphqlRateLimitWithChunkReporter(ctx, reporter, &commentsQuery.RateLimit, err) {

Not sure if this is what you want or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👀

}

err := s.connector.GraphQLClient().Query(ctx, &commentQuery, singlePRVars)
if s.handleGraphqlRateLimitWithChunkReporter(ctx, reporter, &query.RateLimit, err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if s.handleGraphqlRateLimitWithChunkReporter(ctx, reporter, &query.RateLimit, err) {
if s.handleGraphqlRateLimitWithChunkReporter(ctx, reporter, &commentQuery.RateLimit, err) {

Similar to another comment I made. It's not obvious to me from looking here which rate limit we should be respecting. Some inline comments might help.

If we aren't interested in respecting separate rate limits for each query object separately, I don't understand why the rate limit object is attached to the query object. (maybe out of scope for this?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make sure we're respecting the rate limits for single-query objects, like the comments on a single PR. I mistakenly used the wrong rate limit in this case thanks for catching that. While it's rare to hit the rate limit on these individual queries, it's still important to get it right.

commentsPagination: (*githubv4.String)(nil),
}

if err := s.connector.GraphQLClient().Query(ctx, &query, reviewThreadVars); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be rate limited as well?

}

var retryAfter time.Duration
// if rate limit exceeded error happened, wait for 5 minute before trying again
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is out of date now.

numExpectedChunks int
}{
{
name: "unauthenticated, single repo, issue comments and pr comments",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using tabular tests like this. They end up being hard to understand and even harder to maintain. Breaking these out into individual test functions (or even a new test suite) will make our lives easier in the future. imo, it's just as easy to copypasta a test function as it is to copypasta an item in a list of tests, but it's harder to manage as code paths diverge when they're not broken out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The github_integration_test.go file uses tabular test cases throughout. If we want to update the test case structure, we should apply the change consistently across all test cases. It might be better to create a separate ticket or PR for that cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants