Best way to work with complex Excel models from Python?

Hi all,

I am looking for advice on working with complex Excel models from Python.

The Excel files I deal with have multiple sheets, many cross-sheet references, and a lot of conditional logic. What I would like to do is fairly simple in theory: programmatically change some input values and then retrieve the recalculated output values.

In practice, recalculation and compatibility with certain Excel functions become problematic when the model is driven externally.

For those who have worked with similar setups:

Do you keep Excel as the calculation engine, or do you usually port the logic to Python?

Are there tools or patterns that worked well for you?

At what point do you decide an Excel model should be reworked outside Excel?

I am mainly interested in best practices and real-world experiences.

Thanks.

12 points VipeholmsCola
It really depends ona lot of factors but from my experience its almost always easiest to just load the excel(s) in python, recalculate in python and put it back in (often in a new excel, so you dont accidently overwrite the old).

Dealing with excel files in python is a pita. You are going to write python code that does excel operations, at which point its probably easier to just do it in VBA. Your main problems will be how you access the data, dealing with your colleagues inputs (if they add a column, wrong data in wrong field or anything that will break your script).

You might even consider skipping those files all together and write a pipeline to just ingest all data that goes into them and put them in a database, and then write reports from that.

If you want to open and edit an excel, openpyxl is your go-to.

parent
1 points kaburelax
Thank you for your answer

I'm stuck because of this time it's not too easy... I'm used working with excel files, but never needed to get the calculated values.... I always used it to get the final values

This time, I need to load the file, change some specific values, recalculate and then, get the resulting values... The excel file has a lot of internal references, so if you change a value in sheet A, it will get values in the sheet F, that are getting values from other sheets or even doing some new calcs based on the change

parent root

3 points VipeholmsCola

This sounds like someone built a pseudo-db in excel, which has evolved into a monstrosity over a decade.

You should consider doing a complete overhaul of this system. Is it some access database? is VBA scripts involved? How many different sources are pooled from how many sheets?

parent root

1 points kaburelax

There are 27 sheets in the excel file hahahah

And it doesn't access any database, all the values are in the excel, it doesn't have VBA too;

It's really hard to understand what's happening and from where the values come

parent root

1 points VipeholmsCola

And the original data, is that manually inserted in a sheet, and then goes through 10 other sheets?

parent root

1 points kaburelax

Yes, the data is manually inserted in each specific sheet… And these values are used to generate the "final values" that are calculated based on some inputs;

I think that in this image, you can have a general understanding of what I'm saying:

https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fbest-way-to-work-with-complex-excel-models-from-python-v0-at5qqku4zccg1.png%3Fwidth%3D1029%26format%3Dpng%26auto%3Dwebp%26s%3Dfbad33aa0e6c9c2d216bb84f361c4ba782c4fb2b

parent root

1 points VipeholmsCola

OK.

This seems quite easy because it looks structured.

Make a copy of the old excel. 2. Insert your use case in downstream cells 3. access the upstream final cells to retrieve the values. 4. Delete the copy (or dont save the edits in 2, pretty sure you can open and edit but not save).

With openpyxl you can insert values by coordinates in cells in specific sheets, the top left is (0,0). This way you can write several steps that inserts values in specific cells in sheets. Then you can retrieve the values in final sheet with cell(x,y).value function.

This will ofc break if someone else has the excel open at any point.

parent root

1 points kaburelax

Hummm... But the issue is that when I read the file using openpyxl and try to read the cell E9 for instance, it returns the formula or the value as "none" as the formula was not performed.

parent root

1 points Stunning_Macaron6133

Yeah, openpyxl doesn't evaluate formulas. It does have an argument to import the formula as a string, but that's about it.

parent root
2 points Stunning_Macaron6133

For the most part, Excel imports using modules like Pandas or openpyxl work a lot like glorified CSV imports. For the data munging most people are likely to do, that's all you're going to need.

If you need to preserve your Excel formulas, a quick Google search shows formulas, xlcalculator, and pycel as options to run them within your Python code.

There's no definitive answer anyone can give you. It all depends on what you're trying to achieve. Maybe you'll need to reimplement everything from scratch using NumPy and psycopg. Maybe NumPy, Pandas, and SciKit-Learn is enough for some one-off analytics.

parent

1 points kaburelax

Thanks, man!

Yes, as a data scientist I'm used to using .csv or even .xlsx files to READ the files and get some values to perform other things

but in this case, I need to load the file, change some specific values, recalculate and then, get the resulting values... The excel file has a lot of internal references, so if you change a value in sheet A, it will get values in the sheet F, that are getting values from other sheets or even doing some new calcs based on the change

parent root

1 points Stunning_Macaron6133

Well, I came across this little reference, covering a few different Python modules for Excel. I haven't used it myself, but based on what you're describing, Koala might be up your alley.

parent root

1 points kaburelax

Thx man...

I've tried but once I started by the simplest things I did note that it was using very old dependencies... After checking the github repo, seems the last update of koala was 7 years ago.

Almost the same for pycel (4 years ago)

parent root

1 points Stunning_Macaron6133

Old dependencies don't necessarily mean it's bad. You just need to set up your venv accordingly.

parent root

1 points SpiderJerusalem42

Yeah. References are a nightmare. It's like the Omniman meme where the fighter jets are recreating a fraction of the power of pandas' map and apply endofunctors. Sure, it's easier for a regular person to wrap their heads around, but at the end of the day, wildly inefficient. If you're committed to leaving the internal workings to the spreadsheet, you're going to need to use pyautogui on it.

parent root
2 points Crypt0Nihilist

The Excel files I deal with have multiple sheets, many cross-sheet references, and a lot of conditional logic.

Once you get to this point they're probably not using Excel correctly any more. The power of Excel is being able to eyeball your data and that becomes impractical once you have a certain volume and number of tabs. It also becomes more risky with copy/pasting etc.

I'd be looking to move things to Python or maybe KNIME since that should allow business users to stay in charge of their own data without them baulking at the idea of learning a programming language.

parent

2 points Stunning_Macaron6133

Excel is kind of a spreadsheet that wishes it was a relational database. Excel jockeys keep flogging pivot tables like Catholic penitents at a popular march because that's basically the only way to extract any useful insight from a bloated spreadsheet.

parent root

1 points Crypt0Nihilist

The problem is that it's just capable enough to do most things you need and it means that both businesses and people don't invest in other options. I was great at Excel, but once I got my head around R I realised how limited and awful it is. Most people don't get the chance to do that and go on to create abominations and they pat themselves on the back for making it work.

A few weeks ago someone asked me to fix their Excel workbook which was made to plot names in a sheet based on some scores. It was packed with multi-line formulae to deal with the logic of placing and ordering text drawn from the input table and took about three sheets to do it. It was tragic to see the time and the skill that went into using the wrong tool for the job because it was the only tool they had or knew how to use for the job.

parent root

1 points VipeholmsCola

And the even more tragic thing is that in most cases your awesome platform/system is going to produce an excel for someone to deal with. Because thats how most business operates.

parent root

1 points Crypt0Nihilist

Absolutely. Most of my stuff outputs a CSV because the people and the tech aren't there to use anything better. It would make sense to output to BI, or dynamic reports or presentations, but mostly the IT doesn't exist to support it and if it does it's a nano-meter outside people's comfort zone which is too far for most.

parent root
1 points Oddly_Energy

Have you tried xlwings? It is both availble as a python package and as an Excel add-in.

With the python package, python can communicate with Excel and let Excel read, process and save the Excel files instead of doing file operations on its own.

So from python, you can load the file into Excel (with all contents also available to python), change the inputs in the Excel file, let Excel do the recalculation, read the results into python if you need that, and let Excel write the file.

I use it a lot. In my experience, reading and writing large Excel files to/from python, it is actually faster than openpyxl and the other packages I have used.

parent

1 points VipeholmsCola

I havent tried xlwings but a good reason to go with openpyxl or fastexcel is polars

parent root

1 points kaburelax

As I use (and also the servers uses) linux, the xlwings is useless in my case

Seems like xlwing only works on Windows OS

parent root

1 points Stunning_Macaron6133

If all else fails, maybe you can load it in LibreOffice Calc and manually export a CSV with the specific fields you're interested in.

parent root

1 points kaburelax

It doesn't work properly on LibreOffice too man hehehehe

To be able to access the calculated numbers I needed to load the file on https://excel.cloud.microsoft/

parent root