Scheduling a crawler run

This is an end-to-end scenario describing the different operations to schedule a crawler run from a Windows environment.

About this task

Crawling a connection allows you to retrieve data at a large scale and enrich your inventory more efficiently. Use the API to automate the runs, and easily maintain an up-to-date inventory of datasets with high-quality and current data. First, you need the ID of the crawler you want to run. To do this, you need to list all the existing crawlers. Then you need to create a batch file on your computer which will run the crawler when scheduled.

method: GET
endpoint: https://api.<env>.cloud.talend.com/connections/crawlers
headers: {
 "Accept": "application/json",
 "Authorization": "Bearer <your_personal_access_token>"
}

Procedure

  1. Select GET from the Method list and in the field aside, enter the endpoint to be used: https://api.<env>.cloud.talend.com/connections/crawlers.

  2. Click Add header to add a row and enter the following key:value pairs:

    • Accept : application/json
    • Authorization : Bearer <your_personal_access_token>
  3. Send the request.

    The BODY area is updated and the status code 200 is returned. The response should look like this:

    {
        "data": [                                                       
    
        {                                                             
    
          "id": "c86f9c5c-b23c-467a-91ca-443498093c24",               
    
          "connectionId": "43cd2ad4-0f8a-4965-b66b-8388e044fda9",     
    
          "name": "\"aurora Mysql\" - Crawler",                       
    
          "description": "demo",                           
    
          "sharings": [                                               
    
            {                                                         
    
              "scimType": "user",                                     
    
              "scimId": "48037086-8b33-4815-9f67-74345b1d25f5",       
    
              "level": "OWNER"                                        
    
            },                                                        
    
            {                                                         
    
              "scimType": "user",                                     
    
              "scimId": "b33902b8-22fc-4a09-a49a-0592b91e5ad3",       
    
              "level": "READER"                                       
    
            }                                                         
    
          ],                                                          
    
          "status": {                                                 
    
            "runStatus": "Finished",                                  
    
            "runStartedAt": "2023-05-10T07:17:05.943039Z",            
    
            "runBy": "48045786-8b33-4815-9f67-74345b1d25f5",          
    
            "runFinishedAt": "2023-05-10T07:17:20.300821Z"            
    
          },                                                          
    
          "createdAt": "2023-05-04T12:32:44.715629Z",                 
    
          "createdBy": "48045786-8b33-4815-9f67-74345b1d25f5",        
    
          "crawledDatasets": [                                        
    
            "DP_output2",                                             
    
            "test",                                                   
    
            "TEST163",                                                
    
            "test_special_columns"                                    
    
          ],                                                          
    
          "massSamplingId": "c69de04a-1111-43c0-a84c-fdf5a3dea9e0",   
    
          "updateAt": "2023-05-09T15:20:50.069483Z",                  
    
          "updatedBy": "48045786-8b33-4815-9f67-74345b1d25f5"         
    
        } 
      ]
    }
    
  4. On your computer, create an empty batch file (.bat) and name it curl_crawler_run.bat for example.

  5. Add the following instructions in the newly created file:

    @echo off 
    
    SET crawler_id=<crawler_id> 
    
    SET bearer_token=<personal_access_token> 
    
    
    
    REM I want to log information in a file which contains the timestamp in the name. 
    
    set timestamp=%DATE:/=-%_%TIME::=-% 
    
    set timestamp=%timestamp: =% 
    
    set timestamp=%timestamp:,=-% 
    
    set timestamp=%timestamp:.=-% 
    
    set log_file_name="C:\Users\<filepath>\crawler scheduling\crawler_run_%timestamp%.log" 
    
    echo calling the run crawler endpoint with this ID : %crawler_id% > %log_file_name% 
    
    REM performing the call to the run endpoint with POST method 
    
    curl https://api.<env>.cloud.talend.com/connections/crawlers/%crawler_id%/run -H "Authorization: Bearer %bearer_token%" -X POST 
    

    Replace the placeholders with the correct values:

    Parameter Value
    crawler_id ID of the crawler you want to run.
    personal_access_token Your personal access token.
    filepath The filepath to the curl_crawler_run.bat file.
    env Your Talend environment, it could be: eu, us, us-west, ap, au.
  6. Save the file.

  7. Open MS-DOS Editor on your computer.

  8. Enter the following command:

    schtasks /create /tn <task_name>/tr curl_crawler_run.bat /sc <frequency> /st <time> /ru "<domain-name>\<username>"
    

    Replace the placeholders with the correct values:

    Parameter Value
    task_name Name of your task. For example: run_the_crawler.
    frequency Frequency of the crawler run. For example: DAILY.
    time Time of the day when the crawler is run, in the format HH:MM. For example: 10:00.
    domain-name Name of the domain on your computer. For example: TALEND.
    username Name of the user on your computer. For example: jdoe.

    Results

    The crawler will now automatically run every day or every month at a specific time of the day. To delete the scheduled task, you can use this command in you MS-DOS editor:

    schtasks /delete /tn run_the_crawler