AWS Glue: Continuation for job JobBookmark does not exist

This will be a quick post but could not find much on this error, so figured I’d post it for others.

{"service":"AWSGlue","statusCode":400,"errorCode":"EntityNotFoundException","requestId":"xxxxx","errorMessage":"Continuation for job JobBookmark for accountId=xxxxx, jobName=myjob, runId=jr_xxxxx does not exist. not found","type":"AwsServiceError"}

Was recently working on a PySpark job in AWS Glue and was attempting to use the Job Bookmarks feature which lets your Spark jobs bookmark the last set of data read from S3, so that on the next run you don’t process it again. You can read more about this here.

--job-bookmark-option job-bookmark-enable

In my case my job had the bookmark option enabled, and I was properly setting the “transformation_ctx” argument when creating a Glue DynamicFrameReader in the python script… however after running the job it would report as “succeeded” yet when viewing details of the job run I would see the error message above.

In short, what I was missing (which frankly was easy to miss in the docs) was to commit() the job state. Once I changed my job code as follows things started working as expected

import ... 
from awsglue.job import Job
...
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
...
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# your spark code here.... dynamic_frame = glueContext.create_dynamic_frame_from_catalog(
database=my-catalouged-s3-db,
table_name=mytable,
push_down_predicate=my_pushdown_predicate,
transformation_ctx="my-bookmark-name")
...
# this will commit any glue job bookmark info
job.commit()

After integrating the job.commit() statement, the bookmarking functionality started working as expected. Only new data added to the source since the last successful commit is read by the DynamicFrameReader on the next run. Short post but hope it helps someone else!

What would be nice to know… is what specifically is stored where by AWS with regards to bookmark info on every commit… is this stored in S3? Somewhere internally in the Hive catalog? If anyone knows please post a comment.

Originally published at http://bitsofinfo.wordpress.com on April 4, 2021.

stream of engineering: https://github.com/bitsofinfo