Skip to content

Crawler

The Crawler is built with Node.js and development is primarily supported for Visual Studio Code.

After Cloning or Pulling

Install node modules:

cd crawler
npm install

Setting up Editor

Install all recommended extensions when prompted to by Visual Studio Code after opening the workspace. These are configured in .vscode/extensions.json and include:

  • ESLint: Provides live feedback about style rules and potential code issues while editing JavaScript sources
  • markdownlint: Provides live feedback about style rules and potential code issues while editing Markdown sources
  • Node Debug: Provides for live debugging of Node.js scripts

Debugging

A test-organizations.json file at crawler/lib/repositories/organizations/__fixtures__/test-organizations.json is available with a minimal set of organizations for testing.

Debugging Within Visual Studio Code

The Crawl * launch configurations can be used to run partial or full crawls with interactive debugging available. Any breakpoints set within run.js or any classes will pause.

These are configured to use the abbreviated test-organizations.json file described again, comment out the argument to run a full crawl.

Debugging on the Command Line

You can run the crawler from the command line with interactive debugging enabled for attachment over TCP:

# export GITHUB_ACTOR and GITHUB_TOKEN if you git rate limits

node --inspect-brk \
    crawler/run.js \
    --all \
    --commit-to='snapshot/v1' \
    --commit-orgs-to='cfapi/orgs/v1' \
    --orgs-source='crawler/lib/repositories/organizations/__fixtures__/test-organizations.json'

Running Tests

Testing Within Visual Studio Code

With the Crawler: Jest Current File debug configuration selected, you can press F5 or otherwise run the Start Debugging command to execute the *.test.js file you have open. In most cases, you can do this with a source file open too and Jest will find the associated tests. Breakpoints in both test files and sources should work when running tests this way.

Use the Crawler: Jest All debug configuration to run all available tests.

Testing on the Command Line

cd crawler
npm run test

Architecture

crawler/run.js

The main entrypoint for running The Crawler’s command. Yargs is used to parse arguments and implement the command. Run node run.js --help to see all available options.

crawler/lib/repositories/**

Repository classes provide for interaction with specific populations of records.

crawler/lib/parsers/**

Parser classes help read records from raw data.

crawler/package.json

Tracks runtime and development node module dependencies for The Crawler

crawler/**/__tests__/*.test.js

Jest tests, (mostly) aligned with the source files they cover.

crawler/jest.config.js

Global configuration for Jest.

crawler/jest.setup.js

Script configured to run before every test suite.

crawler/.eslint.json

ESLint configuration that should work with Visual Studio Code’s ESLint extension