Jay Chi

in/lostjaylostjaylostjay.xyz

SUMMARY

Backend Developer — production experience designing and maintaining large-scale crawling and data service systems. Built backend services and distributed crawling pipelines using Python, Java, FastAPI, Spring Boot, Kafka, MongoDB, MySQL and Redis, with coroutine-based concurrency, proxy management, retry/backoff strategies, task state management, and structured data parsing. Supported crawler platforms including Purple resume crawling pipeline and Quake headhunter platform crawlers. Provided real-time crawling APIs for certificate verification, web search, and on-demand data retrieval, supporting low-latency business data access.

WORK EXPERIENCE

Crawler Engineer Intern -> Crawler Engineer
Python, Java, JS/Android Reverse Engineering, Anti-bot Analysis, AI Agents
Apr 2024 - Present
  • Built and maintained large-scale web crawling systems for multimodal AI training data and business data demands, covering recruitment, resume, headhunter, and competitor intelligence scenarios
  • Provided real-time crawling APIs for internal business workflows, including certificate verification, web search, and on-demand data retrieval
  • Developed backend services and crawler task workflows using Python, Java, FastAPI, Spring Boot, Kafka, MongoDB, and Redis
  • Designed distributed crawling pipelines with coroutine-based concurrency, proxy management, task scheduling, retry handling, state management, and structured data parsing
  • Performed anti-bot and risk-control analysis, covering device/browser fingerprinting, TLS/HTTP/2 fingerprinting, network traffic capture, captcha solving, and customized patched/stealth Playwright runtimes
  • Reverse-engineered JavaScript, Android, and WeChat Mini Program workflows to analyze request signatures, encryption logic, authentication flows, and anti-crawling mechanisms
  • Applied cryptographic analysis including AES, RSA, and message digest algorithms to reproduce protected request parameters and verify data integrity
  • Designed the architecture of an AI-agent-assisted crawling platform, integrating Model Context Protocol, context management, SSE streaming, and tool orchestration to support crawler configuration, debugging.

PROJECT

Large-scale Crawling · AI Training Data
Python, Java, Distributed Crawling, Proxy Management, Anti-bot Analysis
Apr 2024 – Present
  • Built and maintained large-scale crawling workflows for multimodal AI training data collection, covering text, image, video, document, and structured web data from platforms including YouTube, Zhihu, Baidu Wenku, and other high-risk web sources
  • Supported PB-scale annual data collection volume, contributing to a data supply system with up to 5PB/year collection capacity for AI training and business data delivery
  • Developed distributed crawling pipelines with coroutine-based concurrency, proxy management, retry/backoff strategies, task state management, and structured data parsing to ensure stable high-throughput delivery
  • Analyzed platform anti-bot mechanisms, including request behavior limits, access-control triggers, fingerprinting signals, Android/Web API constraints, and JS-rendering barriers
Recruitment Data · Corporate Landscape · Marketing Intelligence
Python, Java, Playwright, Kafka, MongoDB, Redis, Reverse Engineering
Jan 2025 – Present
  • Built and maintained multiple business-intelligence crawling systems covering headhunter resumes, headhunter jobs, competitor job postings, corporate landscape data, ad promotion data, marketing balance, and transaction records
  • Supported Purple for headhunter-platform resume crawling, including account/session handling, resume fetching, structured parsing, task state management, and callback-based result delivery
  • Supported Quake for headhunter-platform job crawling, including platform login flows, job list/detail retrieval, position parsing, session management, and crawler stability improvements
  • Built competitor job crawling workflows to track job posting updates, online/offline status changes, and market signals from competitor platforms; supported downstream CRM analysis to clean and classify data into two business categories
  • Maintained corporate landscape crawling workflows for company profiles, licenses, qualification records, and related corporate metadata, with primary ownership of Hong Kong company data sources
  • Implemented authenticated crawling workflows for ad promotion, account-level marketing metrics, balance information, promotion records, and transaction details to support financial and marketing data reconciliation
  • Improved crawler robustness through proxy management, browser automation, captcha handling, cookie/session management, retry/backoff strategies, structured error handling, and reverse engineering of request signatures and anti-crawling mechanisms
AI Agent-assisted Crawling Platform
Python, Java, MCP, OpenAI Agents SDK, SSE Streaming
Jul 2025 – Present
  • Designed the architecture of an AI-agent-assisted crawling platform for crawler configuration, field parsing, seed management, and debugging workflows
  • Integrated Model Context Protocol, context management, SSE streaming, and tool orchestration to connect LLM reasoning with crawler tools and browser automation
  • Built agent workflows for request/response analysis, parsing-field recommendation, seed browsing, and crawler configuration assistance
  • Improved multi-turn agent reliability by handling context consistency, tool-call state, MCP service lifecycle, and streaming response behavior

EDUCATION

Bachelor of Management in Information Management and Information Systems
Sep 2020 - Jul 2024

SKILLS

Backend & Distributed Systems

Python·Java·FastAPI·Spring Boot·Asyncio·Concurrency·Socket Programming·Distributed Systems·Kafka·MongoDB·MySQL·Redis

Web Crawling & Reverse Engineering

Web Scraping·Anti-bot & Risk Control Analysis·Device & Browser Fingerprinting·Captcha Solving·JavaScript / Android / WeChat Mini Program Reverse Engineering·Cryptography: AES, RSA, Message Digest Algorithms·TLS / HTTP/2 Fingerprinting·Browser Automation: Playwright

AI Agent Engineering

Model Context Protocol·Context Management·SSE Streaming·Agent Orchestration

DevOps & Infrastructure

Docker Compose·Linux·Git·Nginx·Proxy Networking

Languages

Mandarin Chinese (Native)·English (Professional Working Proficiency)