Scheduling a crawler run

This is an end-to-end scenario describing the different operations to schedule a crawler run from a Windows environment.

About this task

Crawling a connection allows you to retrieve data at a large scale and enrich your inventory more efficiently. Use the API to automate the runs, and easily maintain an up-to-date inventory of datasets with high-quality and current data. First, you need the ID of the crawler you want to run. To do this, you need to list all the existing crawlers. Then you need to create a batch file on your computer which will run the crawler when scheduled.

method: GET
endpoint: https://api.<env>.cloud.talend.com/connections/crawlers
headers: {
 "Accept": "application/json",
 "Authorization": "Bearer <your_personal_access_token>"
}

Procedure

Select GET from the Method list and in the field aside, enter the endpoint to be used: https://api.<env>.cloud.talend.com/connections/crawlers.
Click Add header to add a row and enter the following key:value pairs:
- Accept : application/json
- Authorization : Bearer <your_personal_access_token>

Send the request.

The BODY area is updated and the status code 200 is returned. The response should look like this:

{
    "data": [                                                       

    {                                                             

      "id": "c86f9c5c-b23c-467a-91ca-443498093c24",               

      "connectionId": "43cd2ad4-0f8a-4965-b66b-8388e044fda9",     

      "name": "\"aurora Mysql\" - Crawler",                       

      "description": "demo",                           

      "sharings": [                                               

        {                                                         

          "scimType": "user",                                     

          "scimId": "48037086-8b33-4815-9f67-74345b1d25f5",       

          "level": "OWNER"                                        

        },                                                        

        {                                                         

          "scimType": "user",                                     

          "scimId": "b33902b8-22fc-4a09-a49a-0592b91e5ad3",       

          "level": "READER"                                       

        }                                                         

      ],                                                          

      "status": {                                                 

        "runStatus": "Finished",                                  

        "runStartedAt": "2023-05-10T07:17:05.943039Z",            

        "runBy": "48045786-8b33-4815-9f67-74345b1d25f5",          

        "runFinishedAt": "2023-05-10T07:17:20.300821Z"            

      },                                                          

      "createdAt": "2023-05-04T12:32:44.715629Z",                 

      "createdBy": "48045786-8b33-4815-9f67-74345b1d25f5",        

      "crawledDatasets": [                                        

        "DP_output2",                                             

        "test",                                                   

        "TEST163",                                                

        "test_special_columns"                                    

      ],                                                          

      "massSamplingId": "c69de04a-1111-43c0-a84c-fdf5a3dea9e0",   

      "updateAt": "2023-05-09T15:20:50.069483Z",                  

      "updatedBy": "48045786-8b33-4815-9f67-74345b1d25f5"         

    } 
  ]
}

On your computer, create an empty batch file (.bat) and name it curl_crawler_run.bat for example.

Add the following instructions in the newly created file:

@echo off 

SET crawler_id=<crawler_id> 

SET bearer_token=<personal_access_token> 



REM I want to log information in a file which contains the timestamp in the name. 

set timestamp=%DATE:/=-%_%TIME::=-% 

set timestamp=%timestamp: =% 

set timestamp=%timestamp:,=-% 

set timestamp=%timestamp:.=-% 

set log_file_name="C:\Users\<filepath>\crawler scheduling\crawler_run_%timestamp%.log" 

echo calling the run crawler endpoint with this ID : %crawler_id% > %log_file_name% 

REM performing the call to the run endpoint with POST method 

curl https://api.<env>.cloud.talend.com/connections/crawlers/%crawler_id%/run -H "Authorization: Bearer %bearer_token%" -X POST

Replace the placeholders with the correct values:

Parameter	Value
`crawler_id`	ID of the crawler you want to run.
`personal_access_token`	Your personal access token.
`filepath`	The filepath to the `curl_crawler_run.bat` file.
`env`	Your Talend environment, it could be: `eu`, `us`, `us-west`, `ap`, `au`.

Save the file.
Open MS-DOS Editor on your computer.

Enter the following command:

schtasks /create /tn <task_name>/tr curl_crawler_run.bat /sc <frequency> /st <time> /ru "<domain-name>\<username>"

Replace the placeholders with the correct values:

Parameter	Value
`task_name`	Name of your task. For example: `run_the_crawler`.
`frequency`	Frequency of the crawler run. For example: `DAILY`.
`time`	Time of the day when the crawler is run, in the format HH:MM. For example: `10:00`.
`domain-name`	Name of the domain on your computer. For example: `TALEND`.
`username`	Name of the user on your computer. For example: `jdoe`.

Results

The crawler will now automatically run every day or every month at a specific time of the day. To delete the scheduled task, you can use this command in you MS-DOS editor:

schtasks /delete /tn run_the_crawler

Previous section: Creating a crawler from a JDBC connection