writing
Field Notes

Streamlining Smartsheet with smartsheet_utils: A pandas-First Python Wrapper

A team in the broader org had spun up their financial and operational tracking on Smartsheet. Smartsheet is the right tool for that kind of data — tracking, operational, annotation-heavy rows that are genuinely spreadsheet-shaped, not warehouse-shaped. A proper database or a dedicated financial-software platform would have been the wrong shape for what they needed: you don't want a finance team managing a Postgres schema to track projects in flight. Smartsheet is fast to stand up, pleasant to use as a non-engineer, and unblocks getting answers to leadership this week instead of next quarter. For their purposes, it was a reasonable call.

What didn't work was the connection to the data scientists and analysts downstream. Their data lived in Smartsheet. Our reporting lived in pandas, Snowflake, and Tableau. The plumbing between the two was a manual export button that nobody had automated. The Smartsheet Python SDK itself is a perfectly capable piece of software — it's object-oriented where most of my work is DataFrame-shaped, but that's a thin wrapping job, not a fundamental mismatch. Once the wrapper exists, Smartsheet is, frankly, an insanely friendly data source for DataFrame-centric workflows. The gap wasn't the SDK. The gap was that nobody on the analytics side had built the wrapper.

This post is about the wrapper I built to close that gap, the gnarly parts of the Smartsheet API I learned about in the process, and the specific pattern that lets a bilingual R / Python analytics team plug into a Smartsheet-native data source without either language feeling second-class.

The asymmetry that motivated the whole exercise

Here's what tipped me over the edge: the R community has had a dedicated Smartsheet package — smartsheetr — for years. Pull a sheet in R and you get a tibble. One line, and your data is already shaped the way you actually work with it.

Python didn't have that. Python had the official Smartsheet SDK, which works, and which returns Sheet objects composed of Row objects composed of Cell objects that reference Columns by opaque integer ID. If you're used to pandas, this feels like working with a database through a punch card. If your team is half-R and half-Python — and every serious analytics team I know is — you end up with the R side breezing through Smartsheet work and the Python side either writing increasingly unhinged one-offs or falling back on CSV exports.

That asymmetry ossifies into team dynamics. The R analysts get assigned the Smartsheet work because it's easier for them; the Python engineers stop engaging with it; the institutional knowledge about the Smartsheet-backed sources starts living in one language instead of being shared across both. That's an organizational smell more than a technical one.

The fix is a DataFrame-first Python wrapper. Read a sheet, get a DataFrame. Write a DataFrame, get row IDs back. Nothing in the signature of a single function should expose the fact that Smartsheet's underlying model is object-oriented. That's the whole pitch.

The mental-model mismatch

Smartsheet is not a database. Smartsheet is a spreadsheet-plus that happens to expose an API. That distinction matters more than it sounds.

In the SDK's model, a sheet is an object. That object contains a list of column objects and a list of row objects. Each row object is itself a container for cell objects. Each cell references its column by ID, not by title — because in Smartsheet, columns can be renamed without breaking references.

From a data-model purity standpoint that's fine. From a "I want to filter rows where status is complete" standpoint, it means you cannot do:

sheet = client.Sheets.get_sheet(sheet_id)
complete = [r for r in sheet.rows if r.status == "complete"]  # does not work

Because there is no r.status. There's r.cells, which is a list, and you have to walk it looking for a cell whose column_id matches the ID of the column whose title you think is status. Which means you need to have already looked up the column ID. Which means the naive one-liner becomes:

sheet = client.Sheets.get_sheet(sheet_id)
status_col_id = next(c.id for c in sheet.columns if c.title == "status")
complete = [
    r for r in sheet.rows
    if next((cell.value for cell in r.cells if cell.column_id == status_col_id), None) == "complete"
]

That is more code than a filter-by-column-name should ever take to write. The whole wrapper is essentially about making sure nobody writes it twice.

The round-trip: read_sheet and dataframe_to_rows

The center of the wrapper is a pair of functions. One reads a sheet and gives you back a DataFrame with column titles as column names and a row_id column carrying Smartsheet's row-identity integer. The other takes a DataFrame and posts it back as rows, resolving the title-to-ID mapping internally.

def read_sheet(sheet_id: int) -> pd.DataFrame:
    sheet = get_sheet(sheet_id)
    return sheet_to_dataframe(sheet)

def sheet_to_dataframe(sheet) -> pd.DataFrame:
    col_map = {col.id: col.title for col in sheet.columns}
    records = []
    for row in sheet.rows:
        record = {"row_id": row.id}
        for cell in row.cells:
            title = col_map.get(cell.column_id, str(cell.column_id))
            record[title] = cell.value
        records.append(record)
    return pd.DataFrame(records)

And on the write side:

def dataframe_to_rows(sheet_id: int, df: pd.DataFrame) -> list[int]:
    """Append an entire DataFrame as new rows. Returns the new row IDs."""
    records = df.to_dict(orient="records")
    return add_rows(sheet_id, records)

def add_rows(sheet_id: int, data: list[dict]) -> list[int]:
    client = get_client()
    sheet = get_sheet(sheet_id)
    col_map = {col.title: col.id for col in sheet.columns}

    new_rows = []
    for record in data:
        row = smartsheet.models.Row()
        row.to_bottom = True
        for title, value in record.items():
            if title not in col_map:
                raise ValueError(f"Column not found: {title}")
            cell = smartsheet.models.Cell()
            cell.column_id = col_map[title]
            cell.value = value
            row.cells.append(cell)
        new_rows.append(row)

    response = client.Sheets.add_rows(sheet_id, new_rows)
    return [r.id for r in response.data]

Everything the wrapper does for reading and writing ordinary sheet data is a variation on those two functions. The filter you couldn't write earlier becomes:

df = read_sheet(sheet_id)
complete = df[df["status"] == "complete"]

Which is what you wanted in the first place.

The column-ID indirection problem

The add_rows function above does something subtle that's worth spelling out: it resolves a title → column-ID map inside each call, meaning every write pays the cost of one get_sheet round-trip. For a single 10-row insert that's fine. For an ETL that's inserting 10,000 rows in batches of 500, you do not want to re-fetch the sheet metadata twenty times.

The optimization, which production code should use, is to resolve the column map once and thread it through:

def _resolve_columns(sheet) -> dict[str, int]:
    return {col.title: col.id for col in sheet.columns}

def add_rows_bulk(sheet_id: int, data: list[dict]) -> list[int]:
    client = get_client()
    sheet = get_sheet(sheet_id)
    col_map = _resolve_columns(sheet)
    # ... same as add_rows but no per-call get_sheet ...

The other option is to cache the column map at the client level, keyed by sheet_id. I didn't go that route. Sheet schemas can change under you — someone renames a column in the Smartsheet UI mid-ETL — and the cost of a stale cache is silent data corruption: writes go to the wrong column, and the SDK will not necessarily tell you. One get_sheet per ETL job is cheap insurance.

The row-ID opaque-integer problem

Updates require row IDs. Smartsheet returns row IDs whenever you read a sheet or insert rows, but if your DataFrame doesn't carry those IDs, you can't update anything — you'd have to re-read the sheet, match by some natural key, and figure out which row ID corresponds to which DataFrame row.

The wrapper solves this by having sheet_to_dataframe emit row_id as the first column, and update_rows expect it back:

def update_rows(sheet_id: int, updates: list[dict]) -> None:
    client = get_client()
    sheet = get_sheet(sheet_id)
    col_map = _resolve_columns(sheet)

    rows_to_update = []
    for record in updates:
        row = smartsheet.models.Row()
        row.id = record["row_id"]
        for title, value in record.items():
            if title == "row_id":
                continue
            if title not in col_map:
                raise ValueError(f"Column not found: {title}")
            cell = smartsheet.models.Cell()
            cell.column_id = col_map[title]
            cell.value = value
            row.cells.append(cell)
        rows_to_update.append(row)

    client.Sheets.update_rows(sheet_id, rows_to_update)

Usage pattern, end to end:

df = read_sheet(sheet_id)
df.loc[df["status"] == "pending", "status"] = "in_review"
update_rows(sheet_id, df[df["status"] == "in_review"].to_dict(orient="records"))

Three lines. Read. Modify. Write back. The row_id column travels with the DataFrame throughout, and nobody downstream ever has to think about it.

The column-type problem

Smartsheet columns have a type. Not a pandas dtype — a Smartsheet type. The enum values include TEXT_NUMBER, DATE, DATETIME, CONTACT_LIST, CHECKBOX, PICKLIST, DURATION, ABSTRACT_DATETIME, and a few others.

Pandas dtypes are not a superset of those, and Smartsheet types are not a superset of pandas dtypes. You will at some point write an integer into a TEXT_NUMBER column and wonder why the display is 42.0 instead of 42. You will write a pandas Timestamp into a DATE column and wonder why it got interpreted as a datetime with a phantom 00:00:00. You will try to write a Python True into a CHECKBOX column and it will work — but writing the string "True" will also work, and they are not the same thing, and the API docs do not tell you which you want.

My pragmatic take, after fighting with this: when you create sheets programmatically, set Smartsheet column types explicitly and stick to them. When you write data, coerce on the Python side to match — int() for numbers going into TEXT_NUMBER, pd.Timestamp.date() for DATE, actual bool for CHECKBOX. Don't trust the SDK to do the right thing across the boundary.

The create_sheet function in the wrapper punts to Smartsheet's Passthrough API because the high-level SDK doesn't always expose every column-type option cleanly:

def create_sheet(name: str, columns: list[dict[str, str | bool]]) -> int:
    client = get_client()
    response = client.Passthrough.post(
        "/sheets",
        {"name": name, "columns": columns},
    )
    return int(response.to_dict()["result"]["id"])

And the columns list is passed in full JSON shape, matching the REST API docs:

columns = [
    {"title": "Description", "type": "TEXT_NUMBER", "primary": True},
    {"title": "Status", "type": "PICKLIST", "options": ["pending", "in_review", "complete"]},
    {"title": "Due Date", "type": "DATE"},
    {"title": "Done", "type": "CHECKBOX"},
]
sheet_id = create_sheet("My Tracker", columns)

It's the one place where the wrapper deliberately stays thin. Exposing the full column-type expressiveness is worth more than hiding it behind another abstraction.

Primary columns and formula vs. value

Two more Smartsheet-isms that will save you an afternoon.

Every sheet has exactly one primary column. You set it via primary: True at create time. You cannot change it later without recreating the sheet. The primary column is what shows up in cross-sheet references as the row's identity, and what Smartsheet treats as the "natural label" of a row. Pick it carefully. I default to a description or business-key field, not an auto-generated ID.

Cells have both a value and a formula. If a cell has a formula, cell.value gives you the current evaluated value and cell.formula gives you the formula string. When reading a sheet, I return value. When updating a sheet, never write to a formula-backed cell — the API will accept it silently and then Smartsheet will re-evaluate the formula on the next calculation pass and overwrite your value. The right move is to detect formula cells upstream and skip them.

def sheet_to_dataframe(sheet) -> pd.DataFrame:
    col_map = {col.id: col.title for col in sheet.columns}
    records = []
    for row in sheet.rows:
        record = {"row_id": row.id}
        for cell in row.cells:
            title = col_map.get(cell.column_id, str(cell.column_id))
            # surface formula separately so downstream code can decide
            if cell.formula:
                record[f"{title}__formula"] = cell.formula
            record[title] = cell.value
        records.append(record)
    return pd.DataFrame(records)

That __formula suffix convention lets you pull formulas out when you need them (for diffing sheets across environments, for example) while leaving the default read path clean.

Batching and rate limits

The Smartsheet API is not especially fast, and it has a hard ceiling around 450 rows per write call. If you try to post 500 rows at once, the API rejects the request outright rather than doing what you meant. This is a frequent trip-up during the first ETL that scales past a few hundred rows.

The wrapper batches deletes at 400 rows-per-call to stay safely under:

def delete_rows(sheet_id: int, row_ids: list[int], batch_size: int = 400) -> None:
    client = get_client()
    for i in range(0, len(row_ids), batch_size):
        batch = row_ids[i : i + batch_size]
        client.Sheets.delete_rows(sheet_id, batch)

add_rows batches the same way in production, and does the batching before firing any API call, so a partial failure doesn't leave you in a half-inserted state.

Rate-limit-wise, Smartsheet's quotas are per-user and per-minute, and the API returns 429 with a Retry-After header when you're over. The wrapper doesn't bake in a retry decorator because different callers want different retry semantics (ETLs want retries, interactive scripts want to fail fast). The SDK's internal retry logic is adequate for most use cases; the exception is bulk writes, where you want application-level backoff to avoid wasting the API budget on repeated requests that are all going to get throttled.

Discovery: finding the sheet you want

A working Smartsheet environment easily ends up with hundreds of sheets, scattered across a few dozen workspaces and an unknowable number of personal folders. The Smartsheet search UI is passable for interactive use. For scripts, you want the search primitives surfaced as DataFrames.

list_sheets()               # all sheets the token can see, most-recent first
list_workspaces()           # all workspaces the token can access
list_folders(workspace_id)  # folders within a workspace (or top-level)
search(query, scopes=["sheetNames"])  # search by name with scope filter

Each returns a DataFrame. Because they're DataFrames, you can filter, join, and export them with zero ceremony:

sheets = list_sheets()
q4_trackers = sheets[sheets["name"].str.contains("Q4", case=False)]
q4_trackers.to_csv("q4_sheet_inventory.csv", index=False)

When someone hands you a new environment with "a couple hundred" sheets and asks you to figure out what's what, this is the first thing you run.

The ETL loop

The end-to-end ETL pattern for a Smartsheet-backed data source looks roughly like this:

from smartsheet_utils import read_sheet_by_name, update_rows, add_rows

# pull Smartsheet side as a DataFrame
df_sheet = read_sheet_by_name("Q4 Budget Tracker")

# pull upstream source (a warehouse table, via a role-routed connector)
df_source = query_to_df("""
    SELECT project_code, forecast_spend, actual_spend, updated_at
    FROM analytics.finance.project_spend_weekly
    WHERE week = CURRENT_DATE - 7
""")

# match on business key
merged = df_sheet.merge(
    df_source,
    left_on="Project Code",
    right_on="project_code",
    how="outer",
    indicator=True,
)

# existing rows → update
updates = (
    merged[merged["_merge"] == "both"]
    [["row_id", "Project Code", "forecast_spend", "actual_spend"]]
    .rename(columns={"forecast_spend": "Forecast", "actual_spend": "Actual"})
)
update_rows(sheet_id, updates.to_dict(orient="records"))

# new rows → insert
new = (
    merged[merged["_merge"] == "right_only"]
    [["project_code", "forecast_spend", "actual_spend"]]
    .rename(columns={
        "project_code": "Project Code",
        "forecast_spend": "Forecast",
        "actual_spend": "Actual",
    })
)
add_rows(sheet_id, new.to_dict(orient="records"))

Nothing here is clever. That's the point. The DataFrame-centric flow means every step is a merge, filter, rename, or to_dict — pandas operations, not Smartsheet operations. The same code structure ports straight across to R by swapping merge for a dplyr::full_join and to_dict(orient="records") for the smartsheetr row format. Bilingual teams share patterns, not implementations.

Stale caches and other boring disasters

The quiet failure mode: someone renames a column in the Smartsheet UI between your ETL runs. The title → ID map you built at the start no longer matches reality. Writes targeted at "Forecast" go to a column that used to be "Forecast" and is now "Forecast (Revised)" — except they don't, because the ID resolver raises ValueError: Column not found.

That ValueError is a feature. It is infinitely better than a silent write to the wrong column. My contract with every ETL that writes to Smartsheet is: if a column rename breaks you, fail loudly and fail early. The ETL handles the failure by emailing the Smartsheet owner with a diff of what the column titles used to be and what they are now, and leaves the sheet untouched until a human decides whether the script or the sheet schema is the source of truth.

Related: if you build dashboards on top of the Smartsheet data, put a schema-validation step in front of them too. Fail the dashboard refresh loudly if the expected columns aren't where you left them. The worst place to discover a renamed column is a stakeholder staring at an executive dashboard and wondering why the numbers look wrong.

Sharing as code

One more thing the wrapper makes easy that I didn't expect to find valuable: programmatic access audits.

list_sheet_shares(sheet_id)          # who has access to this sheet
find_user_access("alice@example.com")  # every workspace alice can see
list_groups()                         # every group the token can see
find_my_groups("alice@example.com")    # every group alice is in

When a new analyst joins, instead of asking them "do you have access to X?" I can run an access audit across the sheets they'll be working with and know in five minutes whether the answer is yes, no, or partial. When someone leaves, I can generate the off-boarding delta — every workspace they're still a member of — and hand it to the Smartsheet admin as a checklist.

Not glamorous, but the kind of thing that stops working-access from becoming an ambient source of analyst frustration.

A note on branded output

The full package includes a write_wotc_smartsheet helper that produces branded, formatted sheets for internal reporting — header rows, a fixed color palette, logo assets, currency formatting applied to the right columns automatically. I'm not covering it in this post because it's specific to an internal style system, and the interesting material for a general audience is the generic pattern.

If you want to extend the wrapper with your own branding layer, the relevant hooks are Smartsheet's 17-position format descriptor (which is its own special hell and is documented in the public Smartsheet API reference), and client.Attachments.attach_file_to_sheet() for logo images. That's a different post.

Connecting the team

The thing the wrapper did that I didn't fully appreciate until it was in place: it made the Smartsheet-native team's work a first-class data source in our reporting stack, rather than an isolated island that required a manual bridge to cross.

Before: analysts who wanted to include Smartsheet data in their reporting had to ask someone to export it to CSV, or they wrote one-off scripts that barely worked. Dashboards that pulled from Smartsheet either didn't exist or were manual refresh-and-paste jobs. ETLs that should have been automated were running in somebody's head once a week. The information asymmetry between the Smartsheet-native team and the pandas-native team cost real coordination time, and the fix kept getting deferred because "we'll build it properly later."

After: the Smartsheet data is one read_sheet_by_name() call away. It goes into the same pandas-shaped pipelines as everything else. Dashboards pull from it. ETLs write back to it. The team that owns the sheet doesn't have to change how they work. The team that consumes the data doesn't have to build its reporting around a manual step. Both sides win, and the R half of the org — which already had smartsheetr — now has a Python counterpart that mirrors the same pattern so cross-language collaboration is a question of merge vs. full_join, not "does your language even have a wrapper."

Sometimes the most valuable tooling is the tooling that closes the gap between two teams who had been working around each other instead of with each other. This one closed about six months of accumulated workarounds in a 1,200-line package.

The one-line takeaway

If you're writing Python against Smartsheet and you're not wrapping it in a DataFrame layer, you're writing it wrong. The Smartsheet SDK is a good piece of software. It is also the wrong interface for analysts. Put a read_sheet_by_name and a dataframe_to_rows in front of it, keep the object model out of your pipelines, and you get to pretend Smartsheet is just another tabular data source — which, for your purposes, it is.