Hi all,
I am looking for advice on working with complex Excel models from Python.
The Excel files I deal with have multiple sheets, many cross-sheet references, and a lot of conditional logic. What I would like to do is fairly simple in theory: programmatically change some input values and then retrieve the recalculated output values.
In practice, recalculation and compatibility with certain Excel functions become problematic when the model is driven externally.
For those who have worked with similar setups:
Do you keep Excel as the calculation engine, or do you usually port the logic to Python?
Are there tools or patterns that worked well for you?
At what point do you decide an Excel model should be reworked outside Excel?
I am mainly interested in best practices and real-world experiences.
Thanks.
It really depends ona lot of factors but from my experience its almost always easiest to just load the excel(s) in python, recalculate in python and put it back in (often in a new excel, so you dont accidently overwrite the old).
Dealing with excel files in python is a pita. You are going to write python code that does excel operations, at which point its probably easier to just do it in VBA. Your main problems will be how you access the data, dealing with your colleagues inputs (if they add a column, wrong data in wrong field or anything that will break your script).
You might even consider skipping those files all together and write a pipeline to just ingest all data that goes into them and put them in a database, and then write reports from that.
If you want to open and edit an excel, openpyxl is your go-to.
Thank you for your answer
I'm stuck because of this time it's not too easy... I'm used working with excel files, but never needed to get the calculated values.... I always used it to get the final values
This time, I need to load the file, change some specific values, recalculate and then, get the resulting values... The excel file has a lot of internal references, so if you change a value in sheet A, it will get values in the sheet F, that are getting values from other sheets or even doing some new calcs based on the change
This sounds like someone built a pseudo-db in excel, which has evolved into a monstrosity over a decade.
You should consider doing a complete overhaul of this system. Is it some access database? is VBA scripts involved? How many different sources are pooled from how many sheets?
There are 27 sheets in the excel file hahahah
And it doesn't access any database, all the values are in the excel, it doesn't have VBA too;
It's really hard to understand what's happening and from where the values come
And the original data, is that manually inserted in a sheet, and then goes through 10 other sheets?
Yes, the data is manually inserted in each specific sheet… And these values are used to generate the "final values" that are calculated based on some inputs;
I think that in this image, you can have a general understanding of what I'm saying:
https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fbest-way-to-work-with-complex-excel-models-from-python-v0-at5qqku4zccg1.png%3Fwidth%3D1029%26format%3Dpng%26auto%3Dwebp%26s%3Dfbad33aa0e6c9c2d216bb84f361c4ba782c4fb2b
OK.
This seems quite easy because it looks structured.
With openpyxl you can insert values by coordinates in cells in specific sheets, the top left is (0,0). This way you can write several steps that inserts values in specific cells in sheets. Then you can retrieve the values in final sheet with cell(x,y).value function.
This will ofc break if someone else has the excel open at any point.
Hummm... But the issue is that when I read the file using openpyxl and try to read the cell E9 for instance, it returns the formula or the value as "none" as the formula was not performed.
Yeah, openpyxl doesn't evaluate formulas. It does have an argument to import the formula as a string, but that's about it.
For the most part, Excel imports using modules like Pandas or openpyxl work a lot like glorified CSV imports. For the data munging most people are likely to do, that's all you're going to need.
If you need to preserve your Excel formulas, a quick Google search shows formulas, xlcalculator, and pycel as options to run them within your Python code.
There's no definitive answer anyone can give you. It all depends on what you're trying to achieve. Maybe you'll need to reimplement everything from scratch using NumPy and psycopg. Maybe NumPy, Pandas, and SciKit-Learn is enough for some one-off analytics.
Thanks, man!
Yes, as a data scientist I'm used to using .csv or even .xlsx files to READ the files and get some values to perform other things
but in this case, I need to load the file, change some specific values, recalculate and then, get the resulting values... The excel file has a lot of internal references, so if you change a value in sheet A, it will get values in the sheet F, that are getting values from other sheets or even doing some new calcs based on the change
Well, I came across this little reference, covering a few different Python modules for Excel. I haven't used it myself, but based on what you're describing, Koala might be up your alley.
Thx man...
I've tried but once I started by the simplest things I did note that it was using very old dependencies... After checking the github repo, seems the last update of koala was 7 years ago.
Almost the same for pycel (4 years ago)
Old dependencies don't necessarily mean it's bad. You just need to set up your venv accordingly.
Yeah. References are a nightmare. It's like the Omniman meme where the fighter jets are recreating a fraction of the power of pandas' map and apply endofunctors. Sure, it's easier for a regular person to wrap their heads around, but at the end of the day, wildly inefficient. If you're committed to leaving the internal workings to the spreadsheet, you're going to need to use pyautogui on it.
Once you get to this point they're probably not using Excel correctly any more. The power of Excel is being able to eyeball your data and that becomes impractical once you have a certain volume and number of tabs. It also becomes more risky with copy/pasting etc.
I'd be looking to move things to Python or maybe KNIME since that should allow business users to stay in charge of their own data without them baulking at the idea of learning a programming language.
Excel is kind of a spreadsheet that wishes it was a relational database. Excel jockeys keep flogging pivot tables like Catholic penitents at a popular march because that's basically the only way to extract any useful insight from a bloated spreadsheet.
The problem is that it's just capable enough to do most things you need and it means that both businesses and people don't invest in other options. I was great at Excel, but once I got my head around R I realised how limited and awful it is. Most people don't get the chance to do that and go on to create abominations and they pat themselves on the back for making it work.
A few weeks ago someone asked me to fix their Excel workbook which was made to plot names in a sheet based on some scores. It was packed with multi-line formulae to deal with the logic of placing and ordering text drawn from the input table and took about three sheets to do it. It was tragic to see the time and the skill that went into using the wrong tool for the job because it was the only tool they had or knew how to use for the job.
And the even more tragic thing is that in most cases your awesome platform/system is going to produce an excel for someone to deal with. Because thats how most business operates.
Absolutely. Most of my stuff outputs a CSV because the people and the tech aren't there to use anything better. It would make sense to output to BI, or dynamic reports or presentations, but mostly the IT doesn't exist to support it and if it does it's a nano-meter outside people's comfort zone which is too far for most.
Have you tried xlwings? It is both availble as a python package and as an Excel add-in.
With the python package, python can communicate with Excel and let Excel read, process and save the Excel files instead of doing file operations on its own.
So from python, you can load the file into Excel (with all contents also available to python), change the inputs in the Excel file, let Excel do the recalculation, read the results into python if you need that, and let Excel write the file.
I use it a lot. In my experience, reading and writing large Excel files to/from python, it is actually faster than openpyxl and the other packages I have used.
I havent tried xlwings but a good reason to go with openpyxl or fastexcel is polars
As I use (and also the servers uses) linux, the xlwings is useless in my case
Seems like xlwing only works on Windows OS
If all else fails, maybe you can load it in LibreOffice Calc and manually export a CSV with the specific fields you're interested in.
It doesn't work properly on LibreOffice too man hehehehe
To be able to access the calculated numbers I needed to load the file on https://excel.cloud.microsoft/