The limit is real
GitHub's REST and GraphQL Search APIs return a maximum of 1,000 results per query. This is documented, confirmed by developers, and has no official bypass. For enterprise teams building AI tooling discovery systems, this constraint matters.
AI agents like Claude, OpenAI Codex, and GitHub Copilot use SKILL.md files to define capabilities: PDF handling, Excel formulas, brand guidelines. These files scatter across GitHub in ~/.claude/skills/, .github/skills/, random skills/ folders, and personal dotfiles repos.
A single search for filename:SKILL.md hits the 1,000-result ceiling immediately.
The workaround: query segmentation
The team behind SkillHub, an open-source skill marketplace, ran multiple targeted searches instead of fighting the limit:
Path-based chunking: Separate queries for path:skills, path:.claude, path:.github, path:.codex. Four queries, up to 4,000 potential results.
File size segmentation: size:<1000, size:1000..5000, size:>5000 splits the same files across different result sets.
Topic filtering: Repos tagged claude-skills, agent-skills, ai-skills get deep-scanned individually.
Curated list crawling: Parse awesome-lists for linked repositories, then index those.
Fork traversal: Check popular repo forks for modified skills that never merged upstream.
The system runs daily incremental crawls and weekly full discovery passes. They handle GitHub's 1,000 requests per hour per token limit by rotating credentials.
What they built
The stack: Next.js 15 frontend, PostgreSQL for metadata, Meilisearch for typo-tolerant search, Redis/BullMQ for background jobs. Every skill gets scanned for shell commands, prompt injection patterns, and data exfiltration attempts.
Current index: 172,000+ skills, 4,000+ contributors, 30 categories.
CLI available via npm install -g skillhub. Web interface at skills.palebluedot.live. MIT licensed.
Worth noting
The 172,000 figure is unverified in public GitHub stats. The approach works but developers in forums call similar techniques "dirty hacks" prone to API variability and incomplete coverage. PyGithub's inefficient paging (29 API calls for 1,000 results) pushes some teams toward raw requests library implementations.
For CTOs evaluating AI agent infrastructure: GitHub's search constraints are real, workarounds exist but require multi-query orchestration, and no method guarantees complete coverage. If your team needs large-scale GitHub discovery, budget for query strategy, not just rate limit handling.