Scheduling a crawler run
This is an end-to-end scenario describing the different operations to schedule a crawler run from a Windows environment.About this task
Crawling a connection allows you to retrieve data at a large scale and enrich your inventory more efficiently. Use the API to automate the runs, and easily maintain an up-to-date inventory of datasets with high-quality and current data. First, you need the ID of the crawler you want to run. To do this, you need to list all the existing crawlers. Then you need to create a batch file on your computer which will run the crawler when scheduled.
method: GET
endpoint: https://api.<env>.cloud.talend.com/connections/crawlers
headers: {
"Accept": "application/json",
"Authorization": "Bearer <your_personal_access_token>"
}
Procedure
-
Select GET from the Method list and in the field aside, enter the endpoint to be used:
https://api.<env>.cloud.talend.com/connections/crawlers
. -
Click Add header to add a row and enter the following
key:value
pairs:Accept
:application/json
Authorization
:Bearer <your_personal_access_token>
-
Send the request.
The BODY area is updated and the status code 200 is returned. The response should look like this:
{ "data": [ { "id": "c86f9c5c-b23c-467a-91ca-443498093c24", "connectionId": "43cd2ad4-0f8a-4965-b66b-8388e044fda9", "name": "\"aurora Mysql\" - Crawler", "description": "demo", "sharings": [ { "scimType": "user", "scimId": "48037086-8b33-4815-9f67-74345b1d25f5", "level": "OWNER" }, { "scimType": "user", "scimId": "b33902b8-22fc-4a09-a49a-0592b91e5ad3", "level": "READER" } ], "status": { "runStatus": "Finished", "runStartedAt": "2023-05-10T07:17:05.943039Z", "runBy": "48045786-8b33-4815-9f67-74345b1d25f5", "runFinishedAt": "2023-05-10T07:17:20.300821Z" }, "createdAt": "2023-05-04T12:32:44.715629Z", "createdBy": "48045786-8b33-4815-9f67-74345b1d25f5", "crawledDatasets": [ "DP_output2", "test", "TEST163", "test_special_columns" ], "massSamplingId": "c69de04a-1111-43c0-a84c-fdf5a3dea9e0", "updateAt": "2023-05-09T15:20:50.069483Z", "updatedBy": "48045786-8b33-4815-9f67-74345b1d25f5" } ] }
-
On your computer, create an empty batch file (
.bat
) and name itcurl_crawler_run.bat
for example. -
Add the following instructions in the newly created file:
@echo off SET crawler_id=<crawler_id> SET bearer_token=<personal_access_token> REM I want to log information in a file which contains the timestamp in the name. set timestamp=%DATE:/=-%_%TIME::=-% set timestamp=%timestamp: =% set timestamp=%timestamp:,=-% set timestamp=%timestamp:.=-% set log_file_name="C:\Users\<filepath>\crawler scheduling\crawler_run_%timestamp%.log" echo calling the run crawler endpoint with this ID : %crawler_id% > %log_file_name% REM performing the call to the run endpoint with POST method curl https://api.<env>.cloud.talend.com/connections/crawlers/%crawler_id%/run -H "Authorization: Bearer %bearer_token%" -X POST
Replace the placeholders with the correct values:
Parameter Value crawler_id
ID of the crawler you want to run. personal_access_token
Your personal access token. filepath
The filepath to the curl_crawler_run.bat
file.env
Your Talend environment, it could be: eu
,us
,us-west
,ap
,au
. -
Save the file.
-
Open MS-DOS Editor on your computer.
-
Enter the following command:
schtasks /create /tn <task_name>/tr curl_crawler_run.bat /sc <frequency> /st <time> /ru "<domain-name>\<username>"
Replace the placeholders with the correct values:
Parameter Value task_name
Name of your task. For example: run_the_crawler
.frequency
Frequency of the crawler run. For example: DAILY
.time
Time of the day when the crawler is run, in the format HH:MM. For example: 10:00
.domain-name
Name of the domain on your computer. For example: TALEND
.username
Name of the user on your computer. For example: jdoe
.Results
The crawler will now automatically run every day or every month at a specific time of the day. To delete the scheduled task, you can use this command in you MS-DOS editor:
schtasks /delete /tn run_the_crawler